=Paper= {{Paper |id=Vol-1292/ipamin2014_paper11 |storemode=property |title=Developing Semantic Search for the Patent Domain |pdfUrl=https://ceur-ws.org/Vol-1292/ipamin2014_paper11.pdf |volume=Vol-1292 |dblpUrl=https://dblp.org/rec/conf/konvens/EisingerMS14 }} ==Developing Semantic Search for the Patent Domain== https://ceur-ws.org/Vol-1292/ipamin2014_paper11.pdf
         Developing Semantic Search for the Patent Domain

                   Daniel Eisinger                               Jan Mönnich                     Michael Schroeder
               Technische Universität                      Technische Universität               Technische Universität
                     Dresden                                     Dresden                              Dresden
                      BIOTEC                                      BIOTEC                               BIOTEC
                  Tatzberg 47/49                              Tatzberg 47/49                       Tatzberg 47/49
              01307 Dresden, Germany                      01307 Dresden, Germany               01307 Dresden, Germany
           daniel.eisinger@biotec.tu- jan.moennich@biotec.tu-                                       ms@biotec.tu-
                  dresden.de                dresden.de                                               dresden.de

ABSTRACT                                                                   The number of patent applications continues to rise, reach-
     The patent domain is a very important source of scien-                ing 2.35 million worldwide in 2012 alone [11] - only one year
tific information that is currently not used to its full poten-            after surpassing two million for the first time ever in 2011.
tial. Issues such as high numbers of patents, complicated                  [10]. The number of patent grants is also at an all-time high,
language style and inconsistently used vocabulary make the                 exceeding the one million mark for the first time in 2012 [11].
task of searching for relevant patents extremely complex.                  Additionally, the documents are not always available in En-
While this is already a problem for patent professionals who               glish, which makes finding all relevant documents extremely
have to invest a lot of time and effort into their search, it              difficult. But even for the documents with English-language
is even more problematic for academic scientists with little               versions, there are some unique challenges that separate the
experience in this domain.                                                 patent domain from most other document types.
     Semantic search functionality has been demonstrated to
provide large advantages for document search in other do-                  While it is not unusual to rely mainly on keywords for search-
mains. As an example, the search engine GoPubMed of-                       ing most other document corpora, this approach does not
fers advanced search functionality for the biomedical domain               return satisfactory results for many patent search tasks. Dif-
based on annotating documents with relevant concepts from                  ferent sections of the patent text are written in completely
various ontologies. In this paper, we report on our efforts                different styles, patent authors don’t always use standard
to provide comparable advances for the patent domain. We                   terminology (or it may not even exist), and many patents
introduce the patent search prototype GoPatents, and we                    are written in very unspecific language. The problem has
describe the experiments that we performed during its de-                  been summarized by the European Patent Office (EPO) in
velopment in the areas of term extraction, term and IPC                    the following way, using the term “patentese” for the un-
class co-occurrence analysis, automated patent categoriza-                 conventional language style that is typically only used in
tion, and automated annotation with ontology concepts.                     patents: “Newcomers to intellectual property are often sur-
                                                                           prised or even shocked at the way words or phrases familiar
                                                                           in everyday language are used very differently in the world
1.    INTRODUCTION                                                         of patents. Grammatical constructions that would be un-
As evidenced by a growing number of reports about various                  thinkable in everyday speech or writing are used routinely
high-profile patent trials in recent years, having the neces-              in patentese. Patentese has words which do not even exist in
sary information about all relevant competitor patents can                 ordinary languages. Furthermore patentese exists in every
be vital to a company’s interests. At the same time, patents               conceivable natural language version” [1].
can also be a valuable source for academic research, since
current research results are often first published in a patent             As a result of these problems, professional patent searches
and only afterwards (or never) in a journal. Experts have                  usually don’t rely exclusively on keywords. The most im-
estimated that only 10-15% of the patent content is also                   portant way to improve pure keyword searches is through
described in other publications, and that 80-90% of all sci-               the use of the classification information that is provided by
entific knowledge is contained in patents [2]. Despite that                the patent offices. This information can also be used to
potential, most academic researchers are to our knowledge                  filter or expand search results, but in order to make the
not using patents, presumably due to the high complexity                   most of these possibilities, the searcher must have detailed
of the domain.                                                             knowledge about the classification system. Unfortunately,
                                                                           this is not the case for many academic researchers. Even
                                                                           for professional patent searchers, the process of constructing
                                                                           and refining patent queries is quite complicated and time-
                                                                           consuming.
Copyright c 2014 for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.                   Consequently, it is desirable to offer a system that provides
Published at Ceur-ws.org                                                   an easier option for scientists to perform high-quality patent
Proceedings of the First International Workshop on Patent Mining and Its   searches and assists patent professionals in completing and
Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014.                    refining their initial queries. In order to provide such as-
At KONVENS’14, October 8–10, 2014, Hildesheim, Germany.
sistance, it is important to have a clear understanding of
the properties of patent classification systems. We there-
fore carry out an in-depth investigation of the most com-
mon patent classification system, the International Patent
Classification (IPC). Since the benefit of using existing an-
notations for semantic search has already been demonstrated
in the biomedical domain, we use the controlled vocabulary
“Medical Subject Headings” (MeSH) that is used to annotate
all document abstracts on the biomedical literature database
PubMed as a point of comparison. Following this analysis,
we give a detailed description of multiple approaches we are
proposing to improve patent search, and we introduce the
patent retrieval prototype GoPatents that incorporates some
of these proposals.
                                                                   Figure 1: IPC vs. MeSH - Terms/classes per hierar-
2.      COMPARATIVE ANALYSIS OF MESH                               chy level. Both hierarchies expand in similar ways.
        AND IPC
Our analysis of MeSH and IPC can be divided into three
parts: The first two parts concern the respective hierarchies      uments have around nine on average and often even con-
and terms of the systems themselves, while the third part ex-      siderably more. Additionally, we were able to show that
amines their usage for document classification. We analyzed        the annotation sets for patents are much less diverse than
the latter by collecting classification information from a large   for PubMed, leading us to question the completeness of the
patent corpus as well as the annotations to all PubMed doc-        existing assignments.
uments published by early 2011. Table 1 summarizes some
core results of our analysis.


     Property                              MeSH        IPC
     number of hierarchy entries            54095     69487
     number of unique entries               26581     69487
     number of main trees                     16         8
     number of hierarchy levels               13        14
     occurrence of class labels in text   frequent   very rare
     average number of annotations             9         2
     per document
     proportion of documents with          86%         53%
     multiple annotations
     proportion of documents with re-      81%         46%
     lated annotations
     (i.e., same hierarchy tree)                                   Figure 2: Percentage of documents with number of
                                                                   annotations. The average number of MeSH anno-
Table 1: Comparative analysis MeSH vs. IPC. The                    tations per PubMed document is much higher than
hierarchical structures are similar, but MeSH terms                the number of IPC classes per patent.
are shorter and more likely to occur in text. The
number of MeSH annotations per document far sur-                   We therefore believe that the use of IPC for patent search
passes the number of classes per patent.                           comes with two serious disadvantages: First, the complexity
                                                                   of the system causes significant problems for non-professional
The number of unique MeSH entries is considerably smaller          patent searchers since it is very difficult to find the complete
than the number for IPC, but since the hierarchy tree of           set of IPC classes that are relevant for the search task at
MeSH allows for the same heading to appear more than               hand. Second, the low number of class assignments may lead
once, the sizes are comparable, as are the hierarchies (cf.        to unexpectedly low recall for classification-based patent
Figure 1).                                                         searches that are often performed by professional searchers
                                                                   in order to overcome the problems of keyword search.
The comparison of the terms on the other hand shows some
major differences. IPC is focused on alphanumeric class
codes while MeSH emphasizes terms, IPC definitions are             3.   SEMANTIC SEARCH FOR THE PATENT
longer, more complicated and less self-contained than MeSH              DOMAIN
headings, and are therefore much less likely to appear in the      This section describes our attempts to solve the problems
text. As Figure 2 shows, there are also large differences be-      caused for patent search by incomplete class assignments
tween the numbers of MeSH annotations per document and             and complex patent text. We automatically assign addi-
the numbers of IPC annotations per patent: While most              tional classes, expand initial queries, and we annotate patent
patents have less than five assigned classes, PubMed doc-          documents to make faceted search functionality possible.
3.1    Patent Categorization                                               Corpus     Precision    Recall    F1 -measure
The most straightforward way of dealing with the problem                     C73        0.88        0.90         0.89
of incomplete class assignments would be the assignment of                  C1205       0.88        0.84         0.86
additional classes, but due to the high number of patents as
well as the high complexity of the classification system, this    Table 2: Evaluation results for confidence thresh-
can only be done automatically. Depending on the accuracy         old 0.5. The precision values are identical for both
of the automatic assignment of relevant classes, the method       corpora, but recall is considerably higher for the
can be useful for two related but different ways of dealing       smaller corpus.
with the low number of assigned patent classes:

  1. Given a class, find documents for this class.                for the confidence threshold 0.5. The results are for the most
     If the user knows that a particular class is highly rele-    part encouraging, with most values approaching 0.9. For the
     vant for their search, the automatic class assignments       purpose of our first task, this means that we can retrieve ad-
     can be used to discover additional patents that should       ditional documents with high confidence. The second task,
     have been assigned to the class. The recall of the           finding additional classes for a given document, is more prob-
     search can therefore be improved considerably.               lematic however. Since we apply all classification models to
                                                                  all documents, a precision score of ≈ 0.9 leads to a high num-
  2. Given a document, find classes for this document.            ber of incorrect assignments. While using higher values for
     If the user has already collected a small set of rele-       the confidence threshold has a positive effect on precision,
     vant documents, the automatically assigned classes for       it is accompanied by a severe drop in recall and therefore
     these documents can help them find the classes that are      leads to a significantly lower F1 -measure. This problem is
     related to these documents, even if there is no classi-      caused by slower precision growth for the individual classi-
     fication data available or if there are missing assign-      fiers compared to the situation for PubMed/MeSH, making
     ments. These additional classes again enable them to         additional steps necessary. We propose two filtering options:
     refine their initial search query.                           Since most patent queries also include a keyword component,
                                                                  many of the incorrect assignments are filtered out automat-
Previous approaches to automated patent categorization were       ically since they don’t include the required keywords. Ad-
usually restricted to higher levels of the hierarchy (e.g., [7,   ditionally, we implemented a filter that accepts additional
9, 6]). The only prior effort to classify patents down to the     class assignments only if there is an existing patent that
lowest level of the IPC involved a complicated three-phase        was assigned a similar combination of classes. The filter has
algorithm that is not well suited for application on a large      multiple possible settings, from very restrictive (only allow
corpus; in addition, it already removes large parts of the        classes that have previously co-occurred directly) to much
hierarchy in the first step, which we believe makes it is too     less so (allow pairs of classes if their respective ancestors of
restricting for our goal of finding new relevant but poten-       a certain hierarchy level have been co-assigned). For a small
tially very different classes that were not previously assigned   set of example patents, this filter had the desired effect of
[3]. We therefore based our system on an approach that            filtering out unrelated classes while accepting related ones.
has been used successfully for the automated assignment of
MeSH terms to PubMed documents by Tsatsaronis et al.               IPC code         Class definition          Features 1 to 5
[8]. It is based on training a series of Maximum Entropy-                               (abbrev.)
classifiers (one for each class) on existing class assignments
and applying them to each document that is supposed to             A61B 5/00        Measurement for         light, sensor, blood,
get additional class assignments.                                                    diag. purposes              patient, tissue
                                                                   A61B 17/00            Surgical            tissue, suture, end,
In order to evaluate the results of our categorization efforts,                       instruments              surgical, closure
we constructed two training corpora from the EPO dataset           A61B 17/70             Spinal             rod, bone, portion,
that was also the basis of our previous analysis. The first                            positioners              member, screw
corpus (C73 ) follows strict quality requirements and contains     A61F 13/15       Absorbent pads          absorbent, material,
73 classes while the second one (C1205 ) has more relaxed re-                                               napkin, web, diaper
quirements and is therefore much larger with 1205 classes.         A61M 25/00           Catheter            catheter, distal, end,
This size difference in connection with the expected higher                                                       tube, lumen
quality of the documents due to the constraints we men-            G01N 33/50       Chemical analysis         sample, test, cell,
tioned above should lead to better categorization results for                       of biol. materials          specimen, light
C73 than for C1205 . With our initial evaluation, we tested
our method’s ability to retrieve the classes that were actu-      Table 3: Most influential positive classifier features.
ally assigned to the patents. Therefore, all of these classes     Features were extracted from binary Maximum-
were considered correct while everything else was considered      Entropy classifiers trained on IPC classes with
wrong. While this approach can not evaluate our method’s          biomedical significance. The positive features for
suitability for our objective of assigning new classes, it is     the classifiers in the list are useful for identifying
nevertheless valuable for determining the quality of the clas-    patents that belong to the class.
sifiers by comparing their results with the categorization de-
cisions made by the experts at the patent offices.                The quality of the trained classifiers can also intuitively be
                                                                  judged by looking at the features that make the largest dif-
Table 2 shows the macro-average scores (precision, recall and     ference in categorizing documents. Table 3 shows the five
F1 -measure) of all classifiers using 10-fold cross-validation    most influential positive features from binary Maximum-
Entropy classifiers for a subset of IPC classes with biomedi-        measure was clearly the worst option for the task, and wf-
cal significance, i.e., the features that were assigned the high-    idf as well as LLR were consistently the best. The two new
est positive values by the Maximum Entropy method. The               measures we proposed, majority-tf-idf and majority-wf-idf,
occurrence of these words in a document that is supposed to          were unable to reach the scores that were achieved by wf-idf
be classified increases the likelihood of positive classification;   and LLR, but they were also considerably better than tf-idf.
in other words, the document is more likely to be assigned
the category represented by the classifier. Almost all fea-
tures listed in the table appear to be well suited to making
this distinction, since they are representative of their respec-
tive class. Although some of the class definitions are closely
related, there is very little overlap in the most influential fea-
tures. As an example, the five top features are completely
disjunct for class A61B 17/00 about surgical instruments
and its descendant A61B 17/70 about spinal positioners.

3.2    Guided Patent Search
The second part of our approach to address the problem of            Figure 3: Influence of different ranking measures on
low numbers of patent class assignments and simplify patent          the DCG value of extracted terms. Measure wf-idf
search combines multiple systems intended to guide the user          performs best, followed by LLR and our proposed
towards quickly and easily formulating patent queries that           measures majority-tf-idf and majority-wf-idf. The
are as complete as possible. An initial user query is used           DCG value is the lowest by far for tf-idf.
to determine additional relevant query components. Since
professional patent search queries are a combination of key-
                                                                     We experimented with background corpora that were either
words and class codes in most cases, we investigated ways
                                                                     closely (“diagnostics”) or distantly (“pharma”) related to the
to expand both of these components. The discovered terms
                                                                     class that we extracted the terms from, as well as a general
and classes are recommended to the user so they can decide
                                                                     corpus with no direct relation. Figure 4 shows the average
which of the proposals should be included in the final query.
                                                                     term scores for the first 50 term ranks, demonstrating that
                                                                     for our purpose of extracting relevant terms for a very spe-
We demonstrated that additional relevant keywords can be
                                                                     cific domain, there is a clear benefit from choosing a back-
extracted from a variety of sources including IPC class def-
                                                                     ground corpus that is closely related to the domain: The
initions and external resources such as MeSH. Most impor-
                                                                     scores are highest for the diagnostics background corpus,
tantly, we extract keywords from existing patents using es-
                                                                     followed by the pharma corpus and the general corpus.
tablished natural language processing techniques after an
initial evaluation showed the validity of this approach. Our
method is based on analyzing patents from an IPC class
that has been identified as relevant by the user. Since sig-
nificant numbers of documents are available for most patent
classes, this approach is able to deliver large numbers of
keyword suggestions that are characteristic for the respec-
tive class. In a way, extracting relevant words from class
patents is an expansion of our categorization efforts. Table
3 shows that this approach is able to discover useful key-
words for search. Since we are also interested in relevant
multi-word terms, we performed a more in-depth examina-
tion of different ranking algorithms for such extracted term         Figure 4: Influence of different background corpora
candidates. Additionally, we investigated the influence of           on the average scores of extracted terms. On av-
the background corpus on the result quality. The evalua-             erage, the extracted terms score highest with the
tion of the resulting term rankings was performed manually           closely related corpus (diagnostics) and lowest with
by four information professionals from the Scientific & Busi-        the most distant corpus (general patents).
ness Information Services department of Roche Diagnostics
Penzberg. Interestingly, the experts disagreed often about           The identified terms that are relevant for certain classes can
the relevance of a term, indicating the high complexity of           also be used in the opposite direction, for proposing classi-
the problem.                                                         fication components to add to keyword queries. If the user
                                                                     enters a keyword that has been mapped to an IPC class, this
We evaluated the established statistical term extraction mea-        class can be suggested to the user for expanding their query.
sure tf-idf as well as previously published measures wf-idf          Consequently, even users unfamiliar with the IPC can profit
and Log-Likelihood Ratio (LLR), and we introduced two new            from classification information without investing too much
variants of tf-idf and wf-idf. In order to judge the quality         effort into getting to know the classification system. This
of the resulting term lists based on the scores given by our         is especially true for the biomedical domain, since the avail-
experts, we calculated different quality measures such as the        ability of detailed domain ontologies leads to very precise
average “discounted cumulative gain” (DCG) of the differ-            class suggestions.
ent rankings. Figure 3 shows clear differences between the
ranking methods we investigated: The frequently used tf-idf          Apart from mapping keywords to classes and vice versa as
shown in the previous paragraphs, it is also possible to use       mining systems to the patent domain. We therefore devel-
the co-occurrence of either to retrieve more relevant compo-       oped a new version of the annotator for patent text, based
nents of the same type for the query. For keywords, we have        on the original pipeline described in [4].
already presented various possible sources for co-occurrence
statistics; for patent classes, the existing patent data rep-      In order to help us test the performance of our new anno-
resents a more direct source. In order to find closely re-         tator, professional patent searchers collected a small set of
lated classes to suggest to the user, we analyzed the class        patents related to neoplasms and made it available to us.
co-assignments in our patent corpus. We collected all pairs        The set consisted of 50 patents in total, including a large
of classes that were assigned to the same patent and ranked        number of USPTO patents and smaller numbers of WIPO
them both on the absolute number of co-assignments and             and EPO patents. A team of master students with expertise
the relative number in the form of their Jaccard-Index. We         in the field manually listed all genes and proteins mentioned
hypothesize that pairs of classes with high ranks in either        in the text. Our gold standard was then created in two fur-
ranking are related closely enough that many searches for          ther steps in a semi-automated fashion, by first matching
one of the classes will also have additional relevant results in   these lists to the patent text automatically and then manu-
the second class. Figure 5 shows one example of such a pair        ally curating the result of this process.
of classes, including their definition hierarchy. Although the
left class is clearly more application-oriented than the right     In order to evaluate our new gene annotator for patent text,
one, we argue that many searchers interested in patents from       we used it to assign gene names to this manually anno-
one class will also find relevant patents in the other one. For    tated test corpus of neoplasm patents. The results showed
these example classes, searching for only the first class leads    a very large variation between individual patents, as had to
to over 50% missed possible results, and searching only for        be expected from the equally large variation of text styles
the second still leads to 25% missed results.                      and structures of the patents. On average, we reached a
                                                                   somewhat satisfactory precision of 0.75, while the recall still
                                                                   shows a lot of room for improvement at 0.39. These values
                                                                   correspond to an F1 measure of 0.51. Although these re-
                                                                   sults aren’t nearly as good as the ones achieved by the orig-
                                                                   inal BioCreative annotator, we believe that they represent a
                                                                   promising starting point given the inherent complexity of the
                                                                   patent domain. We hope that an analysis of common anno-
                                                                   tation errors will help us further adapt the system to these
                                                                   special requirements, leading to clear improvements espe-
                                                                   cially concerning the recall of the method. Further analysis
                                                                   of patents with particularly good or particularly bad anno-
                                                                   tation results may also help in this process. The current
Figure 5: Example for semantically related IPC
                                                                   version of the annotator is however already able to provide
classes without any hierarchical relation, detected
                                                                   clear improvements for patent search. In preparation for the
using co-assignment information.
                                                                   patent search prototype GoPatents, it has been applied to an
                                                                   EPO corpus of 1.8 million patents, to which it assigned 157
3.3     Annotation of Patent Documents                             million annotations. The complex and long texts also result
                                                                   in high processing requirements; assigning the annotations
        with Gene/Protein Names                                    to the aforementioned EPO corpus took approximately 6000
The biomedical search engine GoPubMed 1 offers its users           CPU hours.
faceted browsing of search results using the terms from Med-
ical Subject Headings (MeSH) and Gene Ontology (GO) as             While our corpus cannot be considered a representative sam-
well as a protein database. This means that the result-            ple, our analysis of its documents led to some interesting ob-
ing documents can be filtered according to their annotation        servations. With the publication years of our patents spread
terms, allowing the user to quickly and easily reach a re-         between 2001 and 2011, we were able to observe a significant
sult set with very high relevance. This is especially useful       growth in the average number of annotations per patent be-
if the annotation systems are hierarchically organized, since      ginning in 2006. The highest number of annotations to a
this adds the possibility of choosing more specific or more        single patent surpassed 2, 500 gene names. We hypothesize
general filter terms in reaction to the results of the search.     that the development and more wide-spread application of
                                                                   high-throughput techniques is at least partially responsible
In order to provide patent searchers with similar functional-      for this increase. We also kept track of which part of the
ity, we need a system that can annotate patent documents           patents individual annotations were assigned to. Unsurpris-
with the relevant concepts from the ontological resources we       ingly, the Description section was responsible for the largest
intend to use. The protein/gene annotator that is used for         number of annotations. However, a very large number of an-
GoPubMed provides excellent performance for the types of           notations is also contained in tables, which can cause prob-
text it was developed for, namely biomedical abstracts. Its        lems for some automated extraction methods.
quality has been demonstrated at the BioCreative workshop,
where it was the best-performing system for the task of gene       3.4    GoPatents - A Semantic Patent Search
normalization [5]. However, due to the special properties of
patent text it is by no means trivial to transfer existing text           Prototype
                                                                   In order to give a demonstration of some of our proposals,
1
    http://gopubmed.com/web/gopubmed/                              we implemented the patent retrieval prototype GoPatents
that enables the user to filter the resulting patent docu-          In addition to the described functionality, the user’s search
ments using terms from MeSH, Gene Ontology and a protein            history is made available, and the hierarchies can be searched
database. This functionality is brought over from GoPub-            for relevant concepts. Result statistics are calculated auto-
Med, but we added the possibility of using IPC classes for          matically and can be accessed instantly by the user as soon
the same purpose. The user interface is divided into two            as the result set has been retrieved. These statistics cover
columns, a main window on the right and a side column               multiple aspects of the result set, including the most fre-
on the left; an overview is given in Figure 6, showing the          quently assigned terms from the different hierarchies (MeSH,
following main components of the system:                            GO and proteins), the most frequent patent classes and the
                                                                    top applicants.

                                                                    4.   CONCLUSION
                                                                    We presented our approaches to some of the problems that
                                                                    have to be faced by patent searchers, e.g., complicated text,
                                                                    inconsistent vocabulary and incomplete class assignments.
                                                                    Our suggestions include the use of automated categoriza-
                                                                    tion for adding assignments and improving recall, differ-
                                                                    ent guided patent search strategies that help the user refine
                                                                    their queries, and the use of automated annotators to make
                                                                    faceted browsing possible in the patent domain. Our pro-
                                                                    totype GoPatents demonstrates some of the potential that
                                                                    semantic search can bring to the patent domain.

                                                                    5.   REFERENCES
                                                                     [1] K. H. Atkinson. Toward a more rational patent search
                                                                         paradigm. In Proceedings of the 1st ACM workshop on
                                                                         Patent information retrieval, PaIR ’08, pages 37–40.
                                                                         ACM, 2008.
                                                                     [2] S. Brügmann. PATEXPERT project deliverable 8.1 -
Figure 6: Overview of GoPatents patent retrieval                         state of the art in patent processing, 2006.
system prototype. The query is entered in the box                    [3] Y.-L. Chen and Y.-C. Chang. A three-phase method
on top, result documents are shown below, and the                        for patent classification. Information Processing &
faceted browsing functionality is available in the left                  Management, 48(6):1017–1030, 2012.
column.                                                              [4] J. Hakenberg, C. Plake, R. Leaman, M. Schroeder,
                                                                         and G. Gonzalez. Inter-species normalization of gene
                                                                         mentions with GNAT. Bioinformatics, 24(16), 2008.
   • The term hierarchies (left column, second from top)             [5] A. A. Morgan, Z. Lu, X. Wang, A. M. Cohen,
     GoPatents enables the user to refine their search us-               J. Fluck, P. Ruch, A. Divoli, K. Fundel, R. Leaman,
     ing relevant concepts from different sources. The com-              J. Hakenberg, et al. Overview of BioCreative II gene
     plete hierarchies of all annotation systems we used are             normalization. Genome biology, 9(Suppl 2):S3, 2008.
     shown continuously with an indication of how many of
                                                                     [6] D. Tikk, G. Biró, and A. Törcsvári. A hierarchical
     the retrieved documents were annotated with it. The
                                                                         online classifier for patent categorization. Emerging
     user can expand lower levels of the hierarchies for more
                                                                         Technologies of Text Mining: Techniques and
     precise information. Since the IPC class codes are not
                                                                         Applications. IGI Global, 2008.
     informative for users without patent search experience,
     hovering the mouse over a code opens a pop-up window            [7] A. Trappey, F. Hsu, C. Trappey, and C. Lin.
     with the complete definition hierarchy of the class.                Development of a patent document classification and
                                                                         search platform using a back-propagation network.
   • The additional filtering options (left column, third to             Expert Systems with Applications, 31(4):755–765, 2006.
     fifth from top)                                                 [8] G. Tsatsaronis, N. Macari, S. Torge, H. Dietze, and
     GoPatents offers additional possibilities for faceted brows-        M. Schroeder. A maximum-entropy approach for
     ing: Search queries can be refined further to filter for            accurate document annotation in the biomedical
     specific applicants or publication dates.                           domain. Journal of Biomedical Semantics, 3:S2, 2012.
                                                                     [9] S. Verberne, M. Vogel, and E. D’hondt. Patent
   • The search field for entering queries (main window,
                                                                         classification experiments with the linguistic
     top)
                                                                         classification system LCS. In Proceedings of CLEF
     Queries can consist of keywords, IPC classes, terms
                                                                         2010, CLEF-IP Workshop, 2010.
     from the different included hierarchies as well as the
     previously described additional filtering options.             [10] World Intellectual Property Organization. World
                                                                         intellectual property indicators - 2012 edition, 2012.
   • The search results (main window, bottom)                       [11] World Intellectual Property Organization. World
     Snippets of the patents that fit the initial query as well          intellectual property indicators - 2013 edition, 2013.
     as any additional requirements made by including or
     excluding other facets are displayed in the main part
     of the window, providing links to the full patents.