Applications and Challenges of Text Mining with Patents

                Hidir Aras, René Hackl-Sommer, Michael Schwantner and Mustafa Sofean
                                                      FIZ Karlsruhe
                             Hermann-von-Helmholtz-Platz 1, D-76344 Eggenstein-Leopoldshafen
                                              firstname.lastname@fiz-karlsruhe.de


ABSTRACT                                                                    Boolean queries, the diligent usage of proximity operators,
This paper gives insight into our current research on three                 and vast lists of synonyms. New functionality, which helps
text mining tools for patents designed for information pro-                 them in searching and analysing the result set, is therefore
fessionals. The ﬁrst tool identiﬁes numeric properties in the               greatly appreciated. Tools and methods for ordinary docu-
patent text and normalises them, the second extracts a list                 ments are manifold, the challenge is to adapt or to re-design
of keywords that are relevant and reveal the invention in                   them in such a manner that they work with patents.
the patent text, and the third tool attempts to segment the
patent’s description into it’s sections. Our tools are used in              In this paper, we introduce three text mining tools speciﬁ-
the industry and could be applied in research as well.                      cally designed for patent texts we have implemented or are
                                                                            investigating on, respectively. Section 2 describes the nu-
1. INTRODUCTION                                                             meric property extraction, which allows for recognising num-
Patents are a very complex and diﬃcult to analyse type of                   bers, measurements, and intervals. This feature enables the
text. As described in [10], their linguistic structure diﬀers               user to integrate a search for numeric properties, e.g. for
very much from common language. Patents, as a corpus and                    temperature measurements ranging from 150K to 200K, into
as a single document, are both very heterogeneous. They be-                 his query to enhance the precision. Section 3 shows the chal-
long to subject areas as diverse as chemistry, pharmacology,                lenges of automatic keyword extraction with focus on the
mining and all areas of engineering, with the consequence                   invention, giving the user the opportunity to get a quicker
that all kinds of terminology can be found in a patent cor-                 overview of the content of a single document or an answer
pus. A patent corpus usually covers a long time span, often                 set. Section 4 outlines the patent description segmentation,
from the 1950s to the present. Patents from the princi-                     a tool for identifying the several parts which constitute a
pal patent authorities amount to more than 70 million pub-                  patent description. With that, the user can limit his search
lications. Typographical errors are not uncommon, since                     to speciﬁc parts of the description, again for a higher preci-
many patents in their machine-readable form are derived                     sion. Finally, we conclude this work with our main ﬁndings
from OCR-processing and machine-translation. Patents are                    and future work.
on the average two up to ﬁve times longer than scientiﬁc ar-
ticles. Their textual part is composed mainly of the detailed
description of the invention and the claims. The former is
                                                                            2.    NUMERIC PROPERTY EXTRACTION
often similar to scientiﬁc articles, whereas the latter is char-            In many technical ﬁelds, key information is provided in the
acterised by a legal language.                                              form of ﬁgures and units of measurement. However, when
                                                                            these data appear in full text, they are almost certainly lost
Users of patent information usually are information profes-                 for search and retrieval purposes. The reason for this is that
sionals, who cooperate with the research departments or the                 full text is indexed in a way that makes it searchable with
legal department of their companies. They have very high                    strings. In that manner, only the string representation of a
requirements on the correctness and completeness of the                     numeric property would be searchable, which is, of course,
data, on the eﬃciency of the search interface, and on the                   wholly unsatisfactory.
trustworthiness of the provider. The cause of their search
is normally business critical, the endeavour compares to a
search for a needle in a haystack. Their search strategy is by
                                                                            2.1   Related Work
                                                                            To date, some attempts have been made to extract such data
far diﬀerent from a typical Google search; it uses complex
                                                                            automatically from text. A tentative approach in GATE
                                                                            where the identiﬁcation of numeric properties from patents
                                                                            was addressed as a sub-task is described in [1]. [4] exam-
                                                                            ine the detection of units of measurement in English and
                                                                            Croatian newspaper articles over a small sample of 1745
Copyright l’ 2014 for the individual papers by the papers’ authors. Copy-   articles per language using NooJ. [9] investigate the issue
ing permitted for private and academic purposes. This volume is published   from a Belarussian/Russian perspective with many unique
and copyrighted by its editors. Published at Ceur-ws.org Proceedings of     language-related challenges relying on NooJ, too. These ap-
the First International Workshop on Patent Mining and Its Applications
(IPAMIN) 2014. Hildesheim. Oct. 7th. 2014. At KONVENS’14, Octo-             proaches lack either the generalisability to an extensive cor-
ber 8-10, 2014, Hildesheim, Germany.                                        pus or deal mainly with the Russian language. There is also
a commercial tool available from quantalyze1 , however, this      a common base unit for all units which describe the same
tool appears to identify a much more limited variety of units     physical property and convert all instances from the full text
than ours and it also lacks the identiﬁcation of enumerations,    to that base unit for indexing and searching. Therefore,
which are abundant in patents and therefore indispensable.        all instances of units from the full text are converted into
                                                                  their corresponding base units as they are deﬁned in the
2.2 Requirements and Tasks                                        International System of Units (SI).
The following sections describe the requirements and rele-
vant tasks in numeric property extraction.                        Identification of Intervals
                                                                  There are two main ways in which intervals can be con-
Identification of numbers                                         strued. One relies on context words, in which the words sur-
Clearly, a number consisting of digits only can be easily iden-   rounding numeric entities indicate an interval, e.g. between
tiﬁed. For numbers with decimal points we have observed           12 and 100 Watts. Another way comprises the use of sym-
that in our data both numbers following English as well as        bols, e.g. 5–6 mg or >12 hours. While there are only some
German convention are present. Numbers do also appear in          phrases that are often encountered which indicate intervals
scientiﬁc notation, and there is a range of characters that       with bounds on both sides, there are many more when it
is used to denote a multiplication or a exponentiation. We        comes to intervals unbounded on one side. The latter can
also note the use of the HTML sup-tag indicating super-           appear before or behind the numeric entities to which they
script. Examples of valid expressions therefore include:          refer, e.g. more than 200ml or 200ml or more. Negated for-
                                                                  mulations like not more than have to be taken into account
    1,300.5 (English convention); 1.300,5 (German);               as well. Frequently, there are also adverbs present which
    3.6 x 10-4; 10^5; 4.5x10sup"5; 8.44 x 10 sup* 10              add no speciﬁc information to the context, but just need to
Frequently, in patents numbers are spelled-out, as in ten         be ﬁltered out, e.g. about, around, roughly.
mg instead of 10 mg. These instances are recognised and
converted into their respective numerical values.                 Enumerations and Ratios
                                                                  Enumerations of numbers or even intervals are very com-
Identification of units of measurements                           mon in patents. They usually follow a comma-separated
This task, looking simple at ﬁrst sight, requires some at-        pattern: a thickness of 1, 2, 3, 4, or 5 mm. The identiﬁ-
tention with respect to spelling (in particular uppercase vs.     cation of enumerations is rather straightforward as there is
lowercase), spacing, and disambiguation.                          only a small number of variations that together cover >90%
     • Upper-/lower case: There are some instances, in which      of occurrences.
       capital letters and small letters refer to diﬀerent en-
       tities, e.g. S stands for Siemens, a unit for electric     Ratios are used to describe the proportionate relationship
       conductance, whereas s stands for second.                  between two or more entities from a common physical di-
                                                                  mension. A sample expression from an everyday background
     • Spacing: There is some diversity regarding blank char-     might be make sure the ratio between sugar and flour is 1:3.
       acters in spellings of units of measurement consisting     This being a simple example, the recognition of ratios is
       of more than one word, e.g. J per mol-K. Therefore,        actually a diﬃcult endeavour. The reason is the immense
       the longest possible sequence in a series of tokens has    heterogeneity in which ratios can be expressed. Simple ra-
       to be matched.                                             tio formulations are typically separated by colons or slashes.
                                                                  They take general forms like ”Number:Number” or ”Number-
     • Ambiguity: For a few units, their abbreviated spellings    to-Number”. An approach relying solely on these patterns
       might refer to diﬀerent entities, e.g. C might stand for   will invariably locate many false positives.
       Degrees Celsius or Coulomb; A might mean Ampere or
       Angström (cf. Noise Reduction).
                                                                  Noise Reduction
The vast majority of units appear after numbers; however,         The aim of noise reduction is to eliminate false positives.
there are some units that only appear before numbers, like        This is a critical task especially for units of measurements
the pH value or the refractive index.                             consisting of only one letter, the most frequent being the
                                                                  aforementioned A and C.
Unit normalisation
Many measurements of physical properties can be expressed
with various units. For example, 800W is equivalent to 800
                                                                  2.3   Implementation
                                                                  We are using the Apache UIMA framework for the pre-
Joules/second, and 180◦ C to 453 degrees Kelvin. For the
                                                                  sented analysis of data. It provides a robust infrastructure
measurement of pressure, the following non-exhaustive list
                                                                  for developing modular components and deploying them in
of units can be used: kg/m2; N/m2; Pa; Torr; atm; cm
                                                                  a production environment. Finite State Automata (FSA)
Hg; ounces per square yard. Additionally, a great number
                                                                  are used throughout for pattern matching. They perform
of preﬁxes like nano, µ, kilo, tera and their abbreviations
                                                                  much better than Java-patterns and regular expressions, and
have to be considered. Hence, to get a hit with standard
                                                                  even small improvements add up quickly when it comes to
indexing, a user would need to include all sorts of variations
                                                                  processing data in the terabyte range. For the identiﬁca-
in order to achieve even a modicum of accuracy and recall.
                                                                  tion of numbers, intervals, and enumerations valid sequences
Clearly, a superior way to address these issues is to deﬁne
                                                                  of phrase parts and type-related placeholders (both conﬁg-
1
    https://www.quantalyze.com/                                   urable) are expressed in a FSA-based grammar.
Adapted to the English language, our system currently recog-    3.1    Related Work
nises more than 15,000 unit variants belonging to 80 base       [2] used term frequency, phrase frequency and the frequency
units. Included are all commonly used dimensions like time,     of the head noun for identifying the relevant keywords from
temperature, or weight, but also many dimensions that are       a candidate set. The phrase candidates are sorted according
more relevant in professional use, e.g. dynamic viscosity,      to the head noun frequency. Afterwards additional statis-
solubility, or thermal conductivity. We are using a window-     tical ﬁlters are applied. [7] reported that technical terms
ing technique for ratio recognition. From any occurrence of     mainly consist of multi-words, e.g. noun phrases with a
the word ratio in the text, up to ﬁve words to the left and     noun, adjective and the preposition ”of” in English texts.
15 words to the right are evaluated. While this approach        Single words in general are less appropriate for represent-
manages to identify many valid ratios, many cases still re-     ing terminology. Most word combinations describing termi-
main in which ratios are not recognised, like ratios for more   nology are noun phrases with adjective-noun combinations.
than two entities or ratios in alternative formulations (e.g.   Experiments also indicate the impact of the term position,
10 parts carbon black and 4 to 6 parts oil extender ). These    e.g. in title or a special section. It was also shown that
will be dealt with in future versions.                          proper nouns rarely represent good keywords for represent-
                                                                ing terminology.
Conversion between units is a straightforward task. The
units, their variants and conversion rules are kept in a con-
ﬁguration ﬁle. Three more conﬁguration ﬁles are provided
                                                                3.2    Challenges and Tasks
                                                                One main challenge in keyword extraction is related to the
for rules to recognise intervals and for the noise reduction,
                                                                subjectivity of keywords for a particular user, whose exper-
respectively. By this means, changes or extensions can be
                                                                tise, common knowledge about the regarded technical do-
eﬀected without the need to change the source code and re-
                                                                mains and the focus of interest can vary with respect to man-
deploy the software. For the noise reduction task, two lists
                                                                ifold aspects. Besides that, patent full texts describe general
have been deﬁned. The ﬁrst list applies to all units. It con-
                                                                aspects, state of the art that experts are familiar with and
tains terms like figure or example. If one of those global
                                                                make use of expressions and terms that are rarely used in
terms precedes a numeric entity, that entity is judged as
                                                                classic texts (neologisms). Hence, separating the wheat from
noise and removed (examples: figure 1A or drawing 2C ).
                                                                the chaﬀ can be diﬃcult. Moreover, as the description part
The second list is speciﬁc to certain units only. If a term
                                                                of a patent can be very heterogeneous, mixed with tables,
contained therein follows a numeric entity, this text passage
                                                                ﬁgures, examples, mathematical or chemical formulas, etc.,
will be ignored as well (e.g. 13C NMR). Extracted and con-
                                                                identifying relevant sections that contain keywords that are
verted entities are added to our search engine indexes.
                                                                directly related to the invention can be a tricky task as well.
                                                                All these challenges call for deeper analysis of the content, in
Regarding evaluation, we followed an iterative development
                                                                order to better understand patent texts and improve search-
cycle with many intellectual assessments. In the process, we
                                                                ing speciﬁc aspects or entities in the patent texts.
have set up extensive JUnit-tests for software development
and continuous integration. When a test person or, later,
one of our customers found a speciﬁc piece of text that re-
quired improvement, we included it. As a result, given the
size of our data it has over time become increasingly diﬃcult
to ﬁnd text snippets that are not or faultily recognised. We
have not carried out extensive formal recall/precision evalu-
ations, because the eﬀort required building a gold standard
with signiﬁcant sample size and real world data (as opposed
to manually construed ”diﬃcult” data) is not oﬀset by the
projected gains. All our customer feedback indicates that
our results are very good.

                                                                Figure 1: Phrase pattern distribution of top key-
3. KEYWORD EXTRACTION                                           words from three experts (Analysis of EPO patents).
Keywords extracted from a document are of great beneﬁt for
search and content analysis. In the patent domain important     Analyses show that most of the relevant linguistic phrases in
keywords can be utilised for searching as well as getting       patent texts are noun-sequences and noun-adjective combi-
an overview of the topics and the focus of a single patent      nations (Figure 1). Despite this, depending on the domain
document or an answer set. In both cases they can avoid         of interest, complex noun phrases that are used to describe,
unnecessary time-consuming and costly analysis e.g. in prior    e.g. a process, chemical entity or formula, and verbal phrases
art or freedom to operate scenarios. Existing methods for       can be observed. The role of the verbal phrases seems to be
keyword extraction – be it automatic or supervised – use        debatable, as recent results [8] show.
either statistical features for detecting keywords based on
the distribution of words, sub-words and multi-words, or        Investigation of evaluation data from experts indicate that
exploit linguistic information (e.g. part-of-speech) over a     extracting phrases of length ≤ 5 is reasonable in case of
lexical, syntactic or discourse analysis. Furthermore, hybrid   linguistic technical terms, which might be diﬀerent when
approaches exist, which try to combine the various types of     considering also domain-speciﬁc entities from the chemical,
algorithms and apply additional heuristic rules, e.g. based     bio-pharma, or other domains. Figure 2 shows the frequency
on position, length or layout.                                  distribution of the phrase lengths up to 9 words in the an-
notated corpus. For example, in the descriptions part, the       account the respective parameters. In order to avoid loss of
experts annotated more than 350 times phrases consisting         information, a conservative method is preferred over utilising
of only two words. Focusing on automatic keyword extrac-         harsh frequency thresholds. Rather, the overall ranking is
tion, a further prerequisite is to deal with similar phrases     aﬀected by an elaborated weighting scheme considering be-
with diﬀerent morphological and syntactical structure. For       sides intra-section features also ﬁeld-based analysis for the
keyword search or for generating content overviews this syn-     sections title, abstract, claims and the descriptions text.
tactic variations [5] must be normalised and mapped to one
canonical form. For example: circular or rectangular pat-        3.3.1    Dataset and Evaluation
terns → circular pattern, rectangular pattern, method for        The implemented approach was evaluated based on a corpus
combating spam → spam combating method, etc.                     with 20.000 documents from several domains, e.g. chemical,
                                                                 bio-pharma as well as engineering, from the European patent
Another important task that also concerns patent search          database comprising granted patent documents having ti-
in general is semantic normalisation to aggregate semanti-       tle, abstract, claims and descriptions text. An expert-based
cally equivalent or similar phrases which can vary in wording    study served to create a test corpus of 70 patent documents
considerably. The recognition of speciﬁc entities – be it sim-   annotated with keywords in the aforementioned main sec-
ple or complex forms, identifying taxonomic relations, syn-      tions of the patent text. Therefore, the two participating
onyms, chemical entities, enumerations, etc. represent other     experts marked up to 20 most relevant keywords in a patent
challenges in the course of understanding a given patent text    document that characterise the topic and the focus of the de-
beyond general linguistic phrases or terms. In classic key-      scribed invention. The main textual sections comprising the
word extraction, keywords in title or abstract are automat-      combined title-abstract, claims and descriptions were eval-
ically regarded as important, while for patents a sophisti-      uated separately, i.e. keywords sets were not mixed. The
cated weighting scheme based on analysing keyword occur-         created (annotated) datasets were used for evaluating the
rence and co-occurrence with respect to diﬀerent sections is     keyword extraction. For evaluating the implemented base-
required. A further task is to decide how the ﬁnal keyword       lines based on the TF-iDF weighting scheme, the rank-based
set is presented to the user. While in classic keyword extrac-   evaluation metrics precision@k, recall@k and F-Score have
tion rarely more than 10 keywords are returned to the user,      been used.
in the patent domain information professionals indicate that
displaying 50, even 100 keywords would be desirable.             For the ﬁeld combination title-abstract, the exact keyword
                                                                 match results for precision varied between 34% for the top
                                                                 10 keywords and 20% up to 30% for the top 20. Looking
3.3 Implementation and Evaluation                                at recall considering a wider range of up to 50 keywords, a
A proof-of-concept prototype based on linguistic and sta-
                                                                 score around 40% was calculated. As exact match does not
tistical analysis was implemented in order to evaluate some
                                                                 consider syntactic variations for the extracted key phrases,
of the described tasks. The general procedure comprised
                                                                 a fuzzy matching method was applied as well. Depending
the steps for linguistic and statistical pre-processing, noun
                                                                 on the fuzziness parameter, false positives may also be re-
phrase extraction and analysis and phrase weighting based
                                                                 turned, which only can be detected by manual expert-based
on features such as length, position, TF-iDF weight or sec-
                                                                 inspection. The results after applying the fuzzy matching
tion. A typical linguistic pre-processing includes sentence
                                                                 method were much better for precision (˜75% for the top
detection, tokenisation, POS-tagging and noun phrase chunk-
                                                                 10 keywords and 46% for the top 20 keywords) and recall
ing. The noun phrase extraction allows to identify basic pat-
                                                                 (˜87%). For the claims the precision varied between 27%
terns of important noun phrase chunks, while applying a ﬁl-
                                                                 and 30% for the top 20 keywords in case of exact match,
tering method for removing irrelevant (stop-)words at start
                                                                 while again the recall for the extracted keywords increased
and end. As many syntactic variations of the extracted key-
                                                                 from 27% to approx. 46% when taking a wider range of
words may occur besides a syntactic normalisation method,
                                                                 up to 50 keywords. For fuzzy matching, a precision score
linguistic and statistical analysis must be applied in order
                                                                 above 75% for the top 10 keywords and 70% for the top 20
to reduce the candidate set for ranking. A candidate phrase
                                                                 was achieved. In claims, the recall for the top 50 keywords
is evaluated by means of a scoring formula that takes into
                                                                 was about 92%. Due to the heterogeneity and the amount
                                                                 of text present in the descriptions part, the challenges seem
                                                                 here much higher. For the TF-iDF baseline the exact match
                                                                 results for precision varied between 14%-15%, while the re-
                                                                 call for the top 10-50 keywords increases from 8% to 25%.
                                                                 Applying fuzzy matching, the scores for precision were again
                                                                 much better. Depending on the fuzziness parameter for the
                                                                 matching similarity that varied between 0.5 (50% match)
                                                                 and 0.9 (90% match), the precision score was between 80%
                                                                 and 50% for the top 50 keywords for the regarded dataset.

                                                                 4.   TEXT SEGMENTATION
                                                                 Patent documents are lengthy, abundant, and full of de-
                                                                 tails, such that it may hinder the topic analysis for humans
                                                                 and for machines as well. One of the text mining techniques
Figure 2: Phrase length distribution of top keywords             which can ease these intricacies is text segmentation [3]. The
for abstract, claims and descriptions.                           automatic structuring of patent texts into pre-deﬁned sec-
                              Table 1: A list of sections in description text of the patent.
                        Section Types          Example
                        Detailed Description   Best Mode of the Invention, Embodiments of the Invention
                        Background             Background of Invention, Prior Art
                        Summary                Summary of the Invention, Objectives of the Invention, Disclosures
                        Methods                Procedures, Operations, Experiments
                        Drawing and Figures    Detailed Description of the Drawing
                        Applicability          Industrial Applications, Applications of the Invention
                        Technical Field        Technical Field of the Invention, Field of Technology
                        Examples               Embodiment Example, Experimental Example
                        Sequences              List of Sequences, Numerical Sequence
                        References             List of References, Literatures
                        Statements             Statement of Government Rights, Acknowledgement


tions will serve as a pre-processing step to patent retrieval          numbers, math equations, and formulas) via a tokenisation
and information extraction, as well as enable the interested           process. Then, we created the positive-list which contains
people to understand easily the structure of a patent that             terms that appear more than ﬁve times in all headings of the
leads to fast, eﬃcient, and easy access to speciﬁc information         dataset, and the ﬁrst-token list which includes terms from
which they are looking for. Furthermore, noun phrases of               the headers which appear more than ﬁve times as the ﬁrst
important sections in the patent texts could be used as main           word of a header.
features for patent classiﬁcation and clustering to achieve a
good performance.                                                      4.2 Header Detection and Meaning
                                                                       In cooperation with a patent expert, we identiﬁed segmen-
The textual part of a patent contains title, abstract, claims,         tation guidelines. These guidelines help us to understand
and the detailed description (DetD) of the invention. The              the section types (Table 1) in the DetD. In order to discover
latter includes the summary, embodiment, and the descrip-              the headers inside the DetD, we need to get the boundary
tion of ﬁgures and drawings of the invention. As of the                of the headers. i.e., the header’s start and end. We call this
amount of information in DetD, there is a need for auto-               operation Header Detection. Then, we identify the text con-
mated tools, which can determine the document-level struc-             tent which is related to each header. The header meaning
ture of the DetD, identify the diﬀerent sections and map               on the other hand is represented by assigning the header to
them automatically to known section types. There has been              an appropriate section type (e.g.; summary, example, back-
previous work which showed that the semantic of the patent             ground, method, etc). Here, a rule-based approach is more
document structure is valuable in patent retrieval [6], but it         suitable because in the patent domain, there is no suﬃcient
only focused on structured patent text which is labelled by            training data for a machine learning algorithm to be success-
speciﬁc tags in the original text. The work in [1] presented           ful. To do so, we develop a rule-based algorithm to identify
a rule-based information extraction system to automatically            headers and their boundaries. The output consists of all
annotate patents with relevant metadata including section              headers and their positions inside the DetD. Our algorithm
titles. In this section, we describe our text segmentation             works as follows: As input we take the DetD as a sequence of
method which is used to recognise the structure of the DetD.           paraghraphs. Then, we test the following features to decide
                                                                       whether a paragraph is a header or not:
There are many challenges that arise in patent text segmen-
tation, for example measuring the similarity between the                  A. The number of words in the paragraph.
sentences is diﬃcult to use because there are a lot of iden-
                                                                          B. The number of characters in the paragraph.
tical terms in the sentences. Another challenge is that the
patent contains a lot of new technical terminologies which                C. True, if all letters in the current paragraph are in upper
are hard to collect when using a term matching technique.                    case; false otherwise.
To meet these challenges, we currently develop a patent text
segmentation tool which automatically segments the patent                 D. True, if all words in the paragraph start with an upper
text into semantic sections by discovering the headers inside                case letter; false otherwise.
the texts, identifying the text content which is related to
                                                                          E. True, if the current paragraph contains words from the
each header, and determining the meaning of the header.
                                                                             positive-list, false otherwise.
4.1 Dataset and Preprocessing                                             F. True, if in the current paragraph more words start with
Our dataset consists of a random sample of 139,233 patents                   a capital letter than with a lowercase; false otherwise.
from the European Patent Oﬃce (EPO) and converted by
FIZ Karlsruhe2 into a proprietary XML format with tagged                  G. True, if the current paragraph starts with a bullet;
paragraphs. Processing techniques have been applied to                       false otherwise.
understand the type, style, and format of headings inside                 H. True, if the previous or the next paragraph starts with
patent texts. We started by parsing XML ﬁles to get a list of                a bullet; false otherwise.
headings in the description part. The headers pass through
a cleansing process that is represented by removing unde-                  I. True, if the ﬁrst token in the paragraph appears in the
sired tokens in each header (e.g.; numbers, special charac-                   ﬁrst-token list; false otherwise.
ters, words containing special symbols, words starting with
                                                                          J. True, if the current text paragraph contains a simple
2
    http://www.ﬁz-karlsruhe.de                                               chemical text; false otherwise.
  K. The average header length in the dataset’s headers.             speciﬁc domain and use, e.g. treatment of diseases, medical
                                                                     substances, etc. than in an isolated manner. Possible en-
  L. The average number of words in the dataset’s headers.           hanced methods for keyword context analysis could rely on
We use these features on each input paragraph of the DetD            semantic analysis based on the co-occurrence method, (la-
to build decision rules for the header detection. Some of the        tent) semantic analysis or other dedicated semi-supervised
decision rules are listed below:                                     and unsupervised machine learning techniques. Further-
                                                                     more, a more enhanced method for semantic segmentation of
   i. C is true and G is false and A≥1 and J is false                patent text needs to deal with patents that do not have any
                                                                     heading inside their texts and address the overlap problem
  ii. D is true, E is true, A≥1, G is false, and J is false
                                                                     between section types. Our ﬁnal goal is to develop a ﬂexible,
  iii. G is true, H is false, A<L, J is false, B<K, and A≥1          scalable and automatic tool, which has the ability to facili-
                                                                     tate the reading of a patent, keyword extraction, summary
  iv. I is true, F is true, J is false, A≥1, and G is false.         extraction, and classiﬁcation and clustering of patent texts.
After detecting the headers, we identify the start and end
position of each header in the DetD. The detection of the            6.   REFERENCES
text belonging to the header is performed by identifying the          [1] M. Agatonovic, N. Aswani, K. Bontcheva,
paragraphs between the current header and the next header.                H. Cunningham, T. Heitz, Y. Li, I. Roberts, and
After the detection of headers and their boundaries, each                 V. Tablan. Large-scale, Parallel Automatic Patent
header should be assigned to one of the appropriate pre-                  Annotation. In Proceedings of the 1st ACM Workshop
deﬁned section types by using a prediction model from the                 on Patent Information Retrieval, PaIR ’08, pages 1–8,
machine learning step. This task was modelled as a classiﬁ-               New York, NY, USA, 2008. ACM.
cation task via constructing a training dataset by labelling          [2] K. Barker and N. Cornacchia. Using Noun Phrase
manually a representative sample of 1377 headers of sec-                  Heads to Extract Document Keyphrases. In
tion types that are shown in the Table 1. The labelling                   H. Hamilton, editor, Advances in Artificial
process is done by applying the guidelines created by the                 Intelligence, volume 1822 of LNCS, pages 40–52.
patent expert. Pre-processing steps were performed to re-                 Springer Berlin Heidelberg, 2000.
move undesired tokens like numbers, special symbols, and              [3] D. Beeferman, A. Berger, and J. Laﬀerty. Statistical
stopwords, as well as to compute the weight vector for the                Models for Text Segmentation. Machine Learning,
training set. We used Support Vector Machines (SVMs) as                   34(1-3):177–210, Feb. 1999.
a multi-classiﬁcation technique to train the dataset. The             [4] B. Bekavac, Z. Agic, K. Sojat, and M. Tadic. Detecting
evaluation was done by using 5-fold cross validation, and                 Measurement Expressions using NooJ. In Proceedings
the performance of the categorisation achieved up to 91%,                 of the Conference on NooJ, pages 121–127, 2009.
90%, and 91% of accuracy, recall, and precision respectively.
                                                                      [5] R. Bhagat and E. H. Hovy. What Is a Paraphrase?
                                                                          Computational Linguistics, 39(3):463–472, 2013.
5. CONCLUSION AND FUTURE WORK                                         [6] H.-Y. J. Jae-Ho Kim, Jin-Xia Huang and K.-S. Choi.
In this paper we presented our research on three text min-                Patent Document Retrieval and Classiﬁcation at
ing tools tailored to the singularities of patent documents.              KAIST. ”Proceedings of NTCIR-5 Workshop Meeting,
Though patents are very diﬀerent from normal texts in length,             December 6-9, Tokyo, Japan”.
structure, language, and terminology, though the require-             [7] J. S. Justeson and S. M. Katz. Technical Terminology:
ments of patent information searchers are much more strict                Some Linguistic Properties and an Algorithm for
than those of other users, and though no gold-standards for               Identiﬁcation in Text. Natural Language Engineering,
these tasks are available, which reﬂect a realistic retrieval sit-        1(1):9–27, 1995.
uation, we could show, that solutions exist which can cope
                                                                      [8] J. M. Schulz, D. Becks, C. Womser-Hacker, and
with these challenges. The results of our numeric entity
                                                                          T. Mandl. A Resource-light Approach to Phrase
extractor are since long available to our clients and are well
                                                                          Extraction for English and German Documents from
accepted. When designing functionality like keyword extrac-
                                                                          the Patent Domain and User Generated Content. In
tion or description segmentation, we seek at an early stage
                                                                          N. Calzolari, K. Choukri, T. Declerck, M. U. Dogan,
the feedback of our customers. For the numeric property
                                                                          B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis,
extraction, there are still some areas of potential for further
                                                                          editors, LREC, pages 538–543. European Language
research. Disambiguation is one of them: the symbols A
                                                                          Resources Association (ELRA), 2012.
and C were already mentioned; in might be a preposition
or denote inch. Other topics concern the extraction of rela-          [9] A. Skopinava and Y. Hetsevich. Identiﬁcation of
tions. For instance, it might be useful to identify what kind             Expressions with Units of Measurement in Scientiﬁc,
of a temperature is discussed in a text. Is it a melting point            Technical & Legal Texts in Belarusian and Russian. In
or a boiling point? To which substance or process does the                Proceedings of the Integrating IR technologies for
temperature refer? Oftentimes in patents, whole recipe-like               Professional Search Workshop, pages 26–34, 2013.
paragraphs are available from which a lot of factual data            [10] J. M. Struss, D. Becks, T. Mandl, M. Schwantner, and
could be extracted. For keyword extraction, besides the                   C. Womser-Hacker. Patent Retrieval und Patent
challenges discussed before, learning keywords by consider-               Mining: Sind die Anforderungen eingelöst? In 3.
ing domain-speciﬁc knowledge from controlled vocabularies                 DGI-Konferenz. Informationsqualität und
is required to identify most relevant facts about an inven-               Wissensgenerierung., pages 25–36, 2014.
tion more precisely. It is also reasonable to extract keywords
rather on the basis of semantic information tailored for a