<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Developing Semantic Search for the Patent Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Eisinger Jan Mönnich</string-name>
          <email>daniel.eisinger@biotec.tu-</email>
          <email>daniel.eisinger@biotec.tu- jan.moennich@biotec.tudresden.de dresden.de</email>
          <email>jan.moennich@biotec.tu-</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Schroeder</string-name>
          <email>ms@biotec.tudresden.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universität Technische Universität, Dresden Dresden, BIOTEC BIOTEC</institution>
          ,
          <addr-line>Tatzberg 47/49 Tatzberg 47/49, 01307 Dresden, Germany 01307 Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Universität</institution>
          ,
          <addr-line>Dresden, BIOTEC, Tatzberg 47/49, 01307 Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The patent domain is a very important source of scienti c information that is currently not used to its full potential. Issues such as high numbers of patents, complicated language style and inconsistently used vocabulary make the task of searching for relevant patents extremely complex. While this is already a problem for patent professionals who have to invest a lot of time and e ort into their search, it is even more problematic for academic scientists with little experience in this domain. Semantic search functionality has been demonstrated to provide large advantages for document search in other domains. As an example, the search engine GoPubMed offers advanced search functionality for the biomedical domain based on annotating documents with relevant concepts from various ontologies. In this paper, we report on our e orts to provide comparable advances for the patent domain. We introduce the patent search prototype GoPatents, and we describe the experiments that we performed during its development in the areas of term extraction, term and IPC class co-occurrence analysis, automated patent categorization, and automated annotation with ontology concepts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        As evidenced by a growing number of reports about various
high-pro le patent trials in recent years, having the
necessary information about all relevant competitor patents can
be vital to a company's interests. At the same time, patents
can also be a valuable source for academic research, since
current research results are often rst published in a patent
and only afterwards (or never) in a journal. Experts have
estimated that only 10-15% of the patent content is also
described in other publications, and that 80-90% of all
scienti c knowledge is contained in patents [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Despite that
potential, most academic researchers are to our knowledge
not using patents, presumably due to the high complexity
of the domain.
      </p>
      <p>Copyright c 2014 for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes.</p>
      <p>This volume is published and copyrighted by its editors.</p>
      <p>Published at Ceur-ws.org
Proceedings of the First International Workshop on Patent Mining and Its
Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014.</p>
      <p>
        At KONVENS’14, October 8–10, 2014, Hildesheim, Germany.
The number of patent applications continues to rise,
reaching 2:35 million worldwide in 2012 alone [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] - only one year
after surpassing two million for the rst time ever in 2011.
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The number of patent grants is also at an all-time high,
exceeding the one million mark for the rst time in 2012 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Additionally, the documents are not always available in
English, which makes nding all relevant documents extremely
di cult. But even for the documents with English-language
versions, there are some unique challenges that separate the
patent domain from most other document types.
While it is not unusual to rely mainly on keywords for
searching most other document corpora, this approach does not
return satisfactory results for many patent search tasks.
Different sections of the patent text are written in completely
di erent styles, patent authors don't always use standard
terminology (or it may not even exist), and many patents
are written in very unspeci c language. The problem has
been summarized by the European Patent O ce (EPO) in
the following way, using the term \patentese" for the
unconventional language style that is typically only used in
patents: \Newcomers to intellectual property are often
surprised or even shocked at the way words or phrases familiar
in everyday language are used very di erently in the world
of patents. Grammatical constructions that would be
unthinkable in everyday speech or writing are used routinely
in patentese. Patentese has words which do not even exist in
ordinary languages. Furthermore patentese exists in every
conceivable natural language version" [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>As a result of these problems, professional patent searches
usually don't rely exclusively on keywords. The most
important way to improve pure keyword searches is through
the use of the classi cation information that is provided by
the patent o ces. This information can also be used to
lter or expand search results, but in order to make the
most of these possibilities, the searcher must have detailed
knowledge about the classi cation system. Unfortunately,
this is not the case for many academic researchers. Even
for professional patent searchers, the process of constructing
and re ning patent queries is quite complicated and
timeconsuming.</p>
      <p>Consequently, it is desirable to o er a system that provides
an easier option for scientists to perform high-quality patent
searches and assists patent professionals in completing and
re ning their initial queries. In order to provide such
assistance, it is important to have a clear understanding of
the properties of patent classi cation systems. We
therefore carry out an in-depth investigation of the most
common patent classi cation system, the International Patent
Classi cation (IPC). Since the bene t of using existing
annotations for semantic search has already been demonstrated
in the biomedical domain, we use the controlled vocabulary
\Medical Subject Headings" (MeSH) that is used to annotate
all document abstracts on the biomedical literature database
PubMed as a point of comparison. Following this analysis,
we give a detailed description of multiple approaches we are
proposing to improve patent search, and we introduce the
patent retrieval prototype GoPatents that incorporates some
of these proposals.</p>
    </sec>
    <sec id="sec-2">
      <title>2. COMPARATIVE ANALYSIS OF MESH AND IPC</title>
      <p>Our analysis of MeSH and IPC can be divided into three
parts: The rst two parts concern the respective hierarchies
and terms of the systems themselves, while the third part
examines their usage for document classi cation. We analyzed
the latter by collecting classi cation information from a large
patent corpus as well as the annotations to all PubMed
documents published by early 2011. Table 1 summarizes some
core results of our analysis.</p>
      <p>Property
number of hierarchy entries 54095
number of unique entries 26581
number of main trees 16
number of hierarchy levels 13
occurrence of class labels in text frequent
average number of annotations 9
per document
proportion of documents with 86%
multiple annotations
proportion of documents with re- 81%
lated annotations
(i.e., same hierarchy tree)
MeSH</p>
      <p>IPC
69487
69487
8
14
very rare</p>
      <p>2
53%
46%</p>
      <p>The number of unique MeSH entries is considerably smaller
than the number for IPC, but since the hierarchy tree of
MeSH allows for the same heading to appear more than
once, the sizes are comparable, as are the hierarchies (cf.
Figure 1).</p>
      <p>The comparison of the terms on the other hand shows some
major di erences. IPC is focused on alphanumeric class
codes while MeSH emphasizes terms, IPC de nitions are
longer, more complicated and less self-contained than MeSH
headings, and are therefore much less likely to appear in the
text. As Figure 2 shows, there are also large di erences
between the numbers of MeSH annotations per document and
the numbers of IPC annotations per patent: While most
patents have less than ve assigned classes, PubMed
documents have around nine on average and often even
considerably more. Additionally, we were able to show that
the annotation sets for patents are much less diverse than
for PubMed, leading us to question the completeness of the
existing assignments.</p>
      <p>We therefore believe that the use of IPC for patent search
comes with two serious disadvantages: First, the complexity
of the system causes signi cant problems for non-professional
patent searchers since it is very di cult to nd the complete
set of IPC classes that are relevant for the search task at
hand. Second, the low number of class assignments may lead
to unexpectedly low recall for classi cation-based patent
searches that are often performed by professional searchers
in order to overcome the problems of keyword search.</p>
    </sec>
    <sec id="sec-3">
      <title>3. SEMANTIC SEARCH FOR THE PATENT</title>
    </sec>
    <sec id="sec-4">
      <title>DOMAIN</title>
      <p>This section describes our attempts to solve the problems
caused for patent search by incomplete class assignments
and complex patent text. We automatically assign
additional classes, expand initial queries, and we annotate patent
documents to make faceted search functionality possible.</p>
    </sec>
    <sec id="sec-5">
      <title>3.1 Patent Categorization</title>
      <p>The most straightforward way of dealing with the problem
of incomplete class assignments would be the assignment of
additional classes, but due to the high number of patents as
well as the high complexity of the classi cation system, this
can only be done automatically. Depending on the accuracy
of the automatic assignment of relevant classes, the method
can be useful for two related but di erent ways of dealing
with the low number of assigned patent classes:</p>
      <sec id="sec-5-1">
        <title>1. Given a class, nd documents for this class.</title>
        <p>If the user knows that a particular class is highly
relevant for their search, the automatic class assignments
can be used to discover additional patents that should
have been assigned to the class. The recall of the
search can therefore be improved considerably.
2. Given a document, nd classes for this document.</p>
        <p>If the user has already collected a small set of
relevant documents, the automatically assigned classes for
these documents can help them nd the classes that are
related to these documents, even if there is no
classication data available or if there are missing
assignments. These additional classes again enable them to
re ne their initial search query.</p>
        <p>
          Previous approaches to automated patent categorization were
usually restricted to higher levels of the hierarchy (e.g., [
          <xref ref-type="bibr" rid="ref6 ref7 ref9">7,
9, 6</xref>
          ]). The only prior e ort to classify patents down to the
lowest level of the IPC involved a complicated three-phase
algorithm that is not well suited for application on a large
corpus; in addition, it already removes large parts of the
hierarchy in the rst step, which we believe makes it is too
restricting for our goal of nding new relevant but
potentially very di erent classes that were not previously assigned
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. We therefore based our system on an approach that
has been used successfully for the automated assignment of
MeSH terms to PubMed documents by Tsatsaronis et al.
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. It is based on training a series of Maximum
Entropyclassi ers (one for each class) on existing class assignments
and applying them to each document that is supposed to
get additional class assignments.
        </p>
        <p>In order to evaluate the results of our categorization e orts,
we constructed two training corpora from the EPO dataset
that was also the basis of our previous analysis. The rst
corpus (C73) follows strict quality requirements and contains
73 classes while the second one (C1205) has more relaxed
requirements and is therefore much larger with 1205 classes.
This size di erence in connection with the expected higher
quality of the documents due to the constraints we
mentioned above should lead to better categorization results for
C73 than for C1205. With our initial evaluation, we tested
our method's ability to retrieve the classes that were
actually assigned to the patents. Therefore, all of these classes
were considered correct while everything else was considered
wrong. While this approach can not evaluate our method's
suitability for our objective of assigning new classes, it is
nevertheless valuable for determining the quality of the
classi ers by comparing their results with the categorization
decisions made by the experts at the patent o ces.
Table 2 shows the macro-average scores (precision, recall and
F1-measure) of all classi ers using 10-fold cross-validation</p>
      </sec>
      <sec id="sec-5-2">
        <title>Corpus</title>
        <p>for the con dence threshold 0:5. The results are for the most
part encouraging, with most values approaching 0:9. For the
purpose of our rst task, this means that we can retrieve
additional documents with high con dence. The second task,
nding additional classes for a given document, is more
problematic however. Since we apply all classi cation models to
all documents, a precision score of 0:9 leads to a high
number of incorrect assignments. While using higher values for
the con dence threshold has a positive e ect on precision,
it is accompanied by a severe drop in recall and therefore
leads to a signi cantly lower F1-measure. This problem is
caused by slower precision growth for the individual
classiers compared to the situation for PubMed/MeSH, making
additional steps necessary. We propose two ltering options:
Since most patent queries also include a keyword component,
many of the incorrect assignments are ltered out
automatically since they don't include the required keywords.
Additionally, we implemented a lter that accepts additional
class assignments only if there is an existing patent that
was assigned a similar combination of classes. The lter has
multiple possible settings, from very restrictive (only allow
classes that have previously co-occurred directly) to much
less so (allow pairs of classes if their respective ancestors of
a certain hierarchy level have been co-assigned). For a small
set of example patents, this lter had the desired e ect of
ltering out unrelated classes while accepting related ones.</p>
      </sec>
      <sec id="sec-5-3">
        <title>IPC code</title>
        <p>A61B 5/00
A61B 17/00
A61B 17/70
A61F 13/15
A61M 25/00
G01N 33/50</p>
      </sec>
      <sec id="sec-5-4">
        <title>Class de nition (abbrev.)</title>
      </sec>
      <sec id="sec-5-5">
        <title>Measurement for diag. purposes Surgical instruments</title>
        <p>Spinal
positioners
Absorbent pads</p>
      </sec>
      <sec id="sec-5-6">
        <title>Catheter</title>
      </sec>
      <sec id="sec-5-7">
        <title>Chemical analysis</title>
        <p>of biol. materials</p>
        <p>Features 1 to 5
light, sensor, blood,</p>
        <p>patient, tissue
tissue, suture, end,</p>
        <p>surgical, closure
rod, bone, portion,</p>
        <p>member, screw
absorbent, material,
napkin, web, diaper
catheter, distal, end,</p>
        <p>tube, lumen
sample, test, cell,
specimen, light</p>
        <p>The quality of the trained classi ers can also intuitively be
judged by looking at the features that make the largest
difference in categorizing documents. Table 3 shows the ve
most in uential positive features from binary
MaximumEntropy classi ers for a subset of IPC classes with
biomedical signi cance, i.e., the features that were assigned the
highest positive values by the Maximum Entropy method. The
occurrence of these words in a document that is supposed to
be classi ed increases the likelihood of positive classi cation;
in other words, the document is more likely to be assigned
the category represented by the classi er. Almost all
features listed in the table appear to be well suited to making
this distinction, since they are representative of their
respective class. Although some of the class de nitions are closely
related, there is very little overlap in the most in uential
features. As an example, the ve top features are completely
disjunct for class A61B 17/00 about surgical instruments
and its descendant A61B 17/70 about spinal positioners.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>3.2 Guided Patent Search</title>
      <p>The second part of our approach to address the problem of
low numbers of patent class assignments and simplify patent
search combines multiple systems intended to guide the user
towards quickly and easily formulating patent queries that
are as complete as possible. An initial user query is used
to determine additional relevant query components. Since
professional patent search queries are a combination of
keywords and class codes in most cases, we investigated ways
to expand both of these components. The discovered terms
and classes are recommended to the user so they can decide
which of the proposals should be included in the nal query.
We demonstrated that additional relevant keywords can be
extracted from a variety of sources including IPC class
definitions and external resources such as MeSH. Most
importantly, we extract keywords from existing patents using
established natural language processing techniques after an
initial evaluation showed the validity of this approach. Our
method is based on analyzing patents from an IPC class
that has been identi ed as relevant by the user. Since
signi cant numbers of documents are available for most patent
classes, this approach is able to deliver large numbers of
keyword suggestions that are characteristic for the
respective class. In a way, extracting relevant words from class
patents is an expansion of our categorization e orts. Table
3 shows that this approach is able to discover useful
keywords for search. Since we are also interested in relevant
multi-word terms, we performed a more in-depth
examination of di erent ranking algorithms for such extracted term
candidates. Additionally, we investigated the in uence of
the background corpus on the result quality. The
evaluation of the resulting term rankings was performed manually
by four information professionals from the Scienti c &amp;
Business Information Services department of Roche Diagnostics
Penzberg. Interestingly, the experts disagreed often about
the relevance of a term, indicating the high complexity of
the problem.</p>
      <p>We evaluated the established statistical term extraction
measure tf-idf as well as previously published measures wf-idf
and Log-Likelihood Ratio (LLR), and we introduced two new
variants of tf-idf and wf-idf. In order to judge the quality
of the resulting term lists based on the scores given by our
experts, we calculated di erent quality measures such as the
average \discounted cumulative gain" (DCG) of the di
erent rankings. Figure 3 shows clear di erences between the
ranking methods we investigated: The frequently used tf-idf
measure was clearly the worst option for the task, and
wfidf as well as LLR were consistently the best. The two new
measures we proposed, majority-tf-idf and majority-wf-idf,
were unable to reach the scores that were achieved by wf-idf
and LLR, but they were also considerably better than tf-idf.</p>
      <p>We experimented with background corpora that were either
closely (\diagnostics") or distantly (\pharma") related to the
class that we extracted the terms from, as well as a general
corpus with no direct relation. Figure 4 shows the average
term scores for the rst 50 term ranks, demonstrating that
for our purpose of extracting relevant terms for a very
speci c domain, there is a clear bene t from choosing a
background corpus that is closely related to the domain: The
scores are highest for the diagnostics background corpus,
followed by the pharma corpus and the general corpus.
The identi ed terms that are relevant for certain classes can
also be used in the opposite direction, for proposing
classication components to add to keyword queries. If the user
enters a keyword that has been mapped to an IPC class, this
class can be suggested to the user for expanding their query.
Consequently, even users unfamiliar with the IPC can pro t
from classi cation information without investing too much
e ort into getting to know the classi cation system. This
is especially true for the biomedical domain, since the
availability of detailed domain ontologies leads to very precise
class suggestions.</p>
      <p>Apart from mapping keywords to classes and vice versa as
shown in the previous paragraphs, it is also possible to use
the co-occurrence of either to retrieve more relevant
components of the same type for the query. For keywords, we have
already presented various possible sources for co-occurrence
statistics; for patent classes, the existing patent data
represents a more direct source. In order to nd closely
related classes to suggest to the user, we analyzed the class
co-assignments in our patent corpus. We collected all pairs
of classes that were assigned to the same patent and ranked
them both on the absolute number of co-assignments and
the relative number in the form of their Jaccard-Index. We
hypothesize that pairs of classes with high ranks in either
ranking are related closely enough that many searches for
one of the classes will also have additional relevant results in
the second class. Figure 5 shows one example of such a pair
of classes, including their de nition hierarchy. Although the
left class is clearly more application-oriented than the right
one, we argue that many searchers interested in patents from
one class will also nd relevant patents in the other one. For
these example classes, searching for only the rst class leads
to over 50% missed possible results, and searching only for
the second still leads to 25% missed results.</p>
    </sec>
    <sec id="sec-7">
      <title>3.3 Annotation of Patent Documents with Gene/Protein Names</title>
      <p>
        The biomedical search engine GoPubMed 1 o ers its users
faceted browsing of search results using the terms from
Medical Subject Headings (MeSH) and Gene Ontology (GO) as
well as a protein database. This means that the
resulting documents can be ltered according to their annotation
terms, allowing the user to quickly and easily reach a
result set with very high relevance. This is especially useful
if the annotation systems are hierarchically organized, since
this adds the possibility of choosing more speci c or more
general lter terms in reaction to the results of the search.
In order to provide patent searchers with similar
functionality, we need a system that can annotate patent documents
with the relevant concepts from the ontological resources we
intend to use. The protein/gene annotator that is used for
GoPubMed provides excellent performance for the types of
text it was developed for, namely biomedical abstracts. Its
quality has been demonstrated at the BioCreative workshop,
where it was the best-performing system for the task of gene
normalization [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, due to the special properties of
patent text it is by no means trivial to transfer existing text
1http://gopubmed.com/web/gopubmed/
mining systems to the patent domain. We therefore
developed a new version of the annotator for patent text, based
on the original pipeline described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>In order to help us test the performance of our new
annotator, professional patent searchers collected a small set of
patents related to neoplasms and made it available to us.
The set consisted of 50 patents in total, including a large
number of USPTO patents and smaller numbers of WIPO
and EPO patents. A team of master students with expertise
in the eld manually listed all genes and proteins mentioned
in the text. Our gold standard was then created in two
further steps in a semi-automated fashion, by rst matching
these lists to the patent text automatically and then
manually curating the result of this process.</p>
      <p>In order to evaluate our new gene annotator for patent text,
we used it to assign gene names to this manually
annotated test corpus of neoplasm patents. The results showed
a very large variation between individual patents, as had to
be expected from the equally large variation of text styles
and structures of the patents. On average, we reached a
somewhat satisfactory precision of 0:75, while the recall still
shows a lot of room for improvement at 0:39. These values
correspond to an F1 measure of 0:51. Although these
results aren't nearly as good as the ones achieved by the
original BioCreative annotator, we believe that they represent a
promising starting point given the inherent complexity of the
patent domain. We hope that an analysis of common
annotation errors will help us further adapt the system to these
special requirements, leading to clear improvements
especially concerning the recall of the method. Further analysis
of patents with particularly good or particularly bad
annotation results may also help in this process. The current
version of the annotator is however already able to provide
clear improvements for patent search. In preparation for the
patent search prototype GoPatents, it has been applied to an
EPO corpus of 1:8 million patents, to which it assigned 157
million annotations. The complex and long texts also result
in high processing requirements; assigning the annotations
to the aforementioned EPO corpus took approximately 6000
CPU hours.</p>
      <p>While our corpus cannot be considered a representative
sample, our analysis of its documents led to some interesting
observations. With the publication years of our patents spread
between 2001 and 2011, we were able to observe a signi cant
growth in the average number of annotations per patent
beginning in 2006. The highest number of annotations to a
single patent surpassed 2; 500 gene names. We hypothesize
that the development and more wide-spread application of
high-throughput techniques is at least partially responsible
for this increase. We also kept track of which part of the
patents individual annotations were assigned to.
Unsurprisingly, the Description section was responsible for the largest
number of annotations. However, a very large number of
annotations is also contained in tables, which can cause
problems for some automated extraction methods.</p>
    </sec>
    <sec id="sec-8">
      <title>3.4 GoPatents - A Semantic Patent Search</title>
    </sec>
    <sec id="sec-9">
      <title>Prototype</title>
      <p>In order to give a demonstration of some of our proposals,
we implemented the patent retrieval prototype GoPatents
that enables the user to lter the resulting patent
documents using terms from MeSH, Gene Ontology and a protein
database. This functionality is brought over from
GoPubMed, but we added the possibility of using IPC classes for
the same purpose. The user interface is divided into two
columns, a main window on the right and a side column
on the left; an overview is given in Figure 6, showing the
following main components of the system:</p>
      <p>The term hierarchies (left column, second from top)
GoPatents enables the user to re ne their search
using relevant concepts from di erent sources. The
complete hierarchies of all annotation systems we used are
shown continuously with an indication of how many of
the retrieved documents were annotated with it. The
user can expand lower levels of the hierarchies for more
precise information. Since the IPC class codes are not
informative for users without patent search experience,
hovering the mouse over a code opens a pop-up window
with the complete de nition hierarchy of the class.
The additional ltering options (left column, third to
fth from top)
GoPatents o ers additional possibilities for faceted
browsing: Search queries can be re ned further to lter for
speci c applicants or publication dates.</p>
      <p>The search eld for entering queries (main window,
top)
Queries can consist of keywords, IPC classes, terms
from the di erent included hierarchies as well as the
previously described additional ltering options.</p>
      <sec id="sec-9-1">
        <title>The search results (main window, bottom)</title>
        <p>Snippets of the patents that t the initial query as well
as any additional requirements made by including or
excluding other facets are displayed in the main part
of the window, providing links to the full patents.
In addition to the described functionality, the user's search
history is made available, and the hierarchies can be searched
for relevant concepts. Result statistics are calculated
automatically and can be accessed instantly by the user as soon
as the result set has been retrieved. These statistics cover
multiple aspects of the result set, including the most
frequently assigned terms from the di erent hierarchies (MeSH,
GO and proteins), the most frequent patent classes and the
top applicants.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>4. CONCLUSION</title>
      <p>We presented our approaches to some of the problems that
have to be faced by patent searchers, e.g., complicated text,
inconsistent vocabulary and incomplete class assignments.
Our suggestions include the use of automated
categorization for adding assignments and improving recall, di
erent guided patent search strategies that help the user re ne
their queries, and the use of automated annotators to make
faceted browsing possible in the patent domain. Our
prototype GoPatents demonstrates some of the potential that
semantic search can bring to the patent domain.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K. H.</given-names>
            <surname>Atkinson</surname>
          </string-name>
          .
          <article-title>Toward a more rational patent search paradigm</article-title>
          .
          <source>In Proceedings of the 1st ACM workshop on Patent information retrieval</source>
          ,
          <source>PaIR '08</source>
          , pages
          <fpage>37</fpage>
          {
          <fpage>40</fpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bru</surname>
          </string-name>
          <article-title>gmann</article-title>
          .
          <source>PATEXPERT project deliverable 8</source>
          .1
          <article-title>- state of the art in patent processing</article-title>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.-L.</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>A three-phase method for patent classi cation</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>48</volume>
          (
          <issue>6</issue>
          ):
          <volume>1017</volume>
          {
          <fpage>1030</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hakenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Plake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Leaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schroeder</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          .
          <article-title>Inter-species normalization of gene mentions with GNAT</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>24</volume>
          (
          <issue>16</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Morgan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fluck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ruch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Divoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fundel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Leaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hakenberg</surname>
          </string-name>
          , et al.
          <article-title>Overview of BioCreative II gene normalization</article-title>
          .
          <source>Genome biology</source>
          ,
          <volume>9</volume>
          (
          <issue>Suppl 2</issue>
          ):
          <fpage>S3</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tikk</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Biro, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>To</surname>
          </string-name>
          <article-title>rcsvari. A hierarchical online classi er for patent categorization</article-title>
          .
          <source>Emerging Technologies of Text Mining: Techniques and Applications. IGI Global</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Trappey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Trappey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>Development of a patent document classi cation and search platform using a back-propagation network</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <volume>31</volume>
          (
          <issue>4</issue>
          ):
          <volume>755</volume>
          {
          <fpage>765</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tsatsaronis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Macari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Torge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dietze</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Schroeder</surname>
          </string-name>
          .
          <article-title>A maximum-entropy approach for accurate document annotation in the biomedical domain</article-title>
          .
          <source>Journal of Biomedical Semantics</source>
          ,
          <volume>3</volume>
          :
          <fpage>S2</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Verberne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vogel</surname>
          </string-name>
          , and
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>D'hondt. Patent classi cation experiments with the linguistic classi cation system LCS</article-title>
          .
          <source>In Proceedings of CLEF</source>
          <year>2010</year>
          ,
          <string-name>
            <surname>CLEF-IP Workshop</surname>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10] World Intellectual Property Organization.
          <source>World intellectual property indicators - 2012 edition</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11] World Intellectual Property Organization.
          <source>World intellectual property indicators - 2013 edition</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>