<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automated Generation of Timestamped Patent Abstracts at Scale to Outsmart Patent-Trolls</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Felix Hamborg</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Moustafa Elmaghraby</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Corinna Breitinger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bela Gipp</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer and Information Science University of Konstanz</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>s for any patent category and achieves high diversity in content and structure of the resulting abstracts. Furthermore, we timestamp the generated abstracts using a decentralized timestamping service so that users can prove that a generated abstract existed at a certain point in time. In a survey, we found that the quality of the generated abstracts, using criteria defined by the European Patent Office, was 6% higher compared to prior art.</p>
      </abstract>
      <kwd-group>
        <kwd>Natural Language Generation</kwd>
        <kwd>Timestamping</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A patent grants the inventor the right to prevent other parties from producing, using,
importing, or selling an invention without approval. Non-practicing entities (NPE) –
commonly known as patent trolls – use patents as a means for profit. Instead of
researching to advance products or methods, NPEs buy patents from other companies or
file obvious patents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that are worded in such a way that many use cases or approaches
are covered by the patent. NPEs then use such patents to litigate alleged infringements.
Usually, NPEs threaten other companies with a costly lawsuit unless the company
agrees to pay a settlement or a licensing fee. Companies threatened with lawsuits often
choose to settle with a NPE, even if they did not (intentionally) infringe, because patent
litigation is extremely expensive. Median attorneys’ fees range from $0.3m to $12.5m
per lawsuit [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The actions of NPEs can drain companies’ resources [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] in their attempt
to defend themselves against the NPE’s litigations. In some cases, these processes can
amount to millions of dollars [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ]. Such acts have harmed companies of all sizes,
ranging from startups to huge corporations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Our main research question is whether an automated approach can successfully
contribute towards preventing NPEs from pursuing their damaging behavior of filing
obvious patents. This motivates our goal of implementing a system that automatically
generates obvious patent abstracts at scale. Such abstracts must additionally be
syntactically and grammatically sound. While a patent consists of multiple components, such
as a classification into categories, figures, and so-called claims that define the limits of
what is protected by the patent, we choose to generate patent abstracts, because they
represent the summarized explanation of the invention.</p>
      <p>
        In Section 2, we provide an overview of state-of-the-art systems capable of
generating patents and their used techniques. In Section 3, we describe our abstract generation
method. In Section 4, we evaluate the performance of our approach in a survey using
criteria defined by the European Patent Office (EPO) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Previous work</title>
      <p>
        Existing approaches generate grammatically accurate patent abstracts given a suitable
learning dataset. However, we identified the following set of shortcomings for existing
solutions: (1) high specialization, patents can typically only be generated for a single
category [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], (2) non-diverse, sentence structure features no variation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], (3)
accessibility, existing solutions are not open source or not free of charge [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], (4) timestamping,
no secure mechanism is provided for later proving the time of existence for generated
patents [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]–[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and (5) nonsensical semantics [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        All Prior Art (APA) uses an approach [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that generates patent abstracts using an
algorithm merging different existing abstracts together and creating new obvious patent
abstracts. These abstracts are then published under the creative common license, which
shall prevent filing similar obvious ideas as patents. The generated patents feature no
trusted timestamp that could verify their precedence. Also, the abstracts are not matched
against later-filed patents. The generated texts are syntactically correct, but the quality
of the semantics is lacking, which makes them nonsensical. Cloem is a company [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
that creates variants of patent claims, called cloems. The generated claims can be
published to keep potential competitors from attempting to file similar patent claims. This
is achieved through multiple specialized parsers for patent claims. In addition, Cloem
uses proprietary dictionaries created with the aid of Wordnet, Wikipedia, and data
derived from the analysis of 70m patents. The details about the algorithms are
undisclosed. Cloem timestamps and publishes the generated patents on their website. We
found the semantic quality of the generated patents to be higher compared to APA,
however, it is also a paid service. Transform any text into a patent application is an
open source method [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] that transforms a given text into a patent application. The idea
is to find common grammatical structures in patent applications, and to then extract
sentences containing similar structures from the input texts. This is done by analyzing
the sequence of part-of-speech (POS) tags of patents then searching for the most similar
sequences in the input text. The system produces titles, abstracts, and descriptions with
correct grammar. However, the system only accepts text with specific POS structure,
otherwise it cannot generate a patent. Also, all generated patents are structurally similar.
      </p>
      <p>The currently available approaches generate grammatically correct patent abstracts
but suffer from practical limitations: they are fine-tuned to a single category, not free
of charge, and the generated abstracts are of poor language quality, or poor semantics.
Hence, we identify the following requirements to improve the state-of-the-art in
generating patent abstracts. First, the approach must be generic, i.e., the workflow should be
able to generate patents for any given category, instead of being tailored to only one
category. Second, the approach must generate grammatically and semantically correct
patents of sufficiently high quality. Third, the resulting patents must be unique, i.e., the
patent abstracts must be sufficiently different from their sources.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Patent abstract generation</title>
      <p>We describe our method for patent abstract generation following the workflow shown
in Figure 1. The patent abstract generation process starts with the user requesting a
patent category. The second task, abstract generation, generates a patent abstract for
the requested category. Finally, the system exports the generated abstracts to a database
and timestamps them. We describe the process in more detail later in this section.</p>
      <p>Fig. 1. Overall workflow</p>
      <sec id="sec-3-1">
        <title>One time process</title>
      </sec>
      <sec id="sec-3-2">
        <title>Patent abstract generation</title>
        <p>O
T
P
S
U</p>
        <sec id="sec-3-2-1">
          <title>Web Crawling</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Preprocessing</title>
          <p>Data Extraction
POS Tagging</p>
          <p>Dataset</p>
          <p>Patent category</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Abstract Generation</title>
          <p>POS-based Replacement
Grammar Correction
Replacement Rules</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Export</title>
          <p>Timestamping</p>
          <p>Store in DB</p>
          <p>
            In a one-time or regularly repeating process the system performs web crawling to
gather patents from a patent office, which are later used to generate new patents. We
utilize patents filed with the USPTO [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] because their database contains over 2.3m
patents, which can be crawled at no cost. The next task is preprocessing the patents. For
each patent’s URL, we extract the title, abstract, category, publishing date, and
inventors from the HTML data. We perform POS-tagging using Stanford CoreNLP.
          </p>
          <p>The abstract generation consists of three subtasks: first, POS-based replacement to
generate new abstracts. Second, grammar correction, and third, further replacement
rules to improve the language quality of the texts. Our method randomly selects one
patent abstract, called template abstract, of the requested category from our dataset. All
other patents in the dataset belonging to the same category are called patent candidates.</p>
          <p>The POS-based replacement task replaces all nouns and verbs of the template
abstract, hereafter called tokens, with nouns and verbs from the patent candidates.
Specifically, for each token in the template abstract that shall be replaced, we determine the
token’s relative frequency within the patent candidates. We then replace the token from
the template abstract with a token retrieved from the patent candidates that has the same
or most similar relative frequency. This way, we improve the semantic soundness of
the resulting abstract, since such tokens are more likely interchangeable. If there are
multiple candidate tokens with the same frequency, we sample one randomly.</p>
          <p>
            The grammar correction task fixes the tense of the replacing verbs and the plurality
of nouns. We use SimpleNLG, which is an natural language generation (NLG) library
that comes with a default lexicon covering many commonly used English words [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ].
However, our experiments with medical patents showed that the default lexicon is
insufficient to cover the wide range of nouns and verbs used in medical patents. Thus, we
additionally use the Specialist Lexicon [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], which covers general English terms and
medical terminology. We adjust the tense of the replacing verb to the tense of the
replaced verb and do the equivalent for noun plurality using devised rules.
          </p>
          <p>We apply further replacement rules to improve the language quality. We found that
almost all abstracts start with a sentence containing a type-defining noun, such as
“[Techniques, methods, an apparatus] [are, is] disclosed for […]”. We observed that
replacing the first noun with another noun decreases the semantic quality of the
generated patent abstract, so we chose not to replace the first noun since it fits best to the
patent template abstract. As we will show in Section 4, this functionality is one reason
why our approach achieves better semantic quality compared to the reviewed
approaches. Also, we do not replace auxiliary verbs in the first sentence, since they
accompany the main verb and are not category-specific. To ensure semantic soundness
our method always replaces words that occur multiple times with the same word.</p>
          <p>
            Finally, our method timestamps the abstract using OriginStamp [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], which is a
trusted timestamping service that runs on the Bitcoin blockchain. Trusted timestamping
is the process of keeping a tamper-proof and permanent record of the creation time of
documents. OriginStamp allows its users to prove that their timestamped data existed
at a certain point in time in a certain state by submitting a SHA256 hash of the data to
the service. Users can then retrieve and verify the timestamps that have been committed
to the blockchain. Timestamping is a key component of our project, since it is the means
for proving the time of existence of the generated patent abstracts.
4
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and discussion</title>
      <p>
        We conducted a survey to evaluate our method using the criteria for patent applications
defined by the EPO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and common NLG criteria [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Therefore, we randomly
sampled three abstract from patents filed at the USPTO in January 2017, and three abstracts
each generated by our method or APA, respectively. All abstracts belonged to the
category data processing systems, since APA only generates abstracts in this category. We
asked the participants, ten computer science students aged between 20 and 30, to first
read an introduction that explained the evaluation criteria. Participants were not told
that they were rating abstracts from different sources and that some of the abstracts
were automatically generated. The experiment was not time constrained. Participants
were shown one abstract at a time and asked to rate each criterion on a Likert scale from
one (lowest quality) to six (highest). The NLG criteria were readability (Read),
accuracy (Acc), and usefulness (Use). The EPO criteria were inventiveness (Inv), i.e., the
degree of invention, application (App), i.e., whether the invention can be applied
industrially, novelty (Nov), i.e., whether the idea is new, and inventive step (InvS), i.e.,
how non-obvious the idea is. The setup does not allow a realistic assessment of novelty,
inventiveness, and inventive step, since a comprehensive study of prior art would be
required. However, we were still interested in these criteria to get insights on how
inventive the abstracts appeared to the participants.
      </p>
      <p>Table 1 shows that our method outperforms APA by 0.16 (6%) in the average total.
The average score was also higher in all criteria except for readability. The average
readability score shows that the readability of APA patent abstracts (3.07) are slightly
higher than the ones generated by our system (2.93), with a margin of 0.14. As
expected, the quality of real patent abstracts was rated higher than that of both generation
methods, specifically by 0.56 (20%) better than our method.</p>
      <p>To evaluate the consistency of the abstracts across all criteria, we also calculated the
variance of the scores given by study participants. Our system showed more consistent
performance than APA for readability, usefulness, inventiveness, application, novelty,
and inventive step. The variance was particularly good for usefulness (0.09) and
application (0.09). However, the accuracy (0.21) is worse than that of APA (0.04).</p>
      <p>Through manually testing random samples of the generated patents, we observed
that the semantics quality of our generated patent abstracts could vary widely. This
depended on the length of the generated abstract. We also noticed a limited amount of
grammar mistakes occurring for specialized scientific or rarely occurring words. We
deduce that the main cause is the accuracy of the POS tagger. The diversity can be
improved by using more patent sources beyond the USPTO.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and future work</title>
      <p>We proposed a method that generates patent abstracts to address the problem of
nonpracticing entities (NPEs) filing obvious patents. Our system introduces four main
improvements to the current state-of-the-art: first, our system can generate abstracts for
any patent category. Second, the method performs trusted timestamping so that users
can prove that a generated abstract existed at a certain point in time. Third, the
generated abstracts score better overall than APA as to criteria for patent applications as
defined by the European Patent Office. Fourth, the abstracts are also better than APA
according to criteria for natural language generation. We believe that the proposed
system is a first step towards limiting the high cost of NPEs abusing the patent system.</p>
      <p>
        Future improvements to our proposed system include publishing the generated
abstracts on a publicly available archive. Then, we will devise and implement a search
engine that captures obvious patent abstracts by measuring their similarity to previously
generated and published abstracts. We plan to measure the similarity between two
abstracts using semantic similarity measures [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Finally, the system should inform the
authors of detected obvious patents. We also plan to further investigate how we can
improve the semantic quality of the generated abstracts.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>T.</given-names>
            <surname>Fischer</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Henkel</surname>
          </string-name>
          , “
          <article-title>Patent trolls on markets for technology - An empirical analysis of NPEs' patent acquisitions</article-title>
          ,
          <source>” Res. Policy</source>
          , vol.
          <volume>41</volume>
          , no.
          <issue>9</issue>
          , pp.
          <fpage>1519</fpage>
          -
          <lpage>1533</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>C.</given-names>
            <surname>Barry</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Arad</surname>
          </string-name>
          , “2016
          <string-name>
            <given-names>Patent</given-names>
            <surname>Litigation</surname>
          </string-name>
          <article-title>Study: Are we at an inflection point?</article-title>
          ,”
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Bessen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Meurer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Ford</surname>
          </string-name>
          , “
          <article-title>The Private and Social Costs of Patent Trolls,” SSRN Electron</article-title>
          . J.,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J.</given-names>
            <surname>Muellin</surname>
          </string-name>
          , “
          <article-title>Famous patent 'troll's' lawsuit against Google booted out of East Texas</article-title>
          ,”
          <year>2017</year>
          . [Online]. Available: https://arstechnica.com/tech-policy/
          <year>2017</year>
          /02/famous-patent
          <article-title>-trollslawsuit-against-google-booted-out-of-east-texas/</article-title>
          . [Accessed:
          <fpage>06</fpage>
          -May-2017].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>European</given-names>
            <surname>Patent</surname>
          </string-name>
          <string-name>
            <surname>Office</surname>
          </string-name>
          , “
          <article-title>Guidelines for Examination in the European Patent Office</article-title>
          ,”
          <year>2016</year>
          . [Online]. Available: http://www.epo.org/law-practice/legaltexts/html/guidelines/e/g_i_1.htm. [Accessed:
          <fpage>15</fpage>
          -May-2017].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Reben</surname>
          </string-name>
          , “
          <article-title>All Prior Art - Algorithmically generated prior art</article-title>
          .” [Online]. Available: http://allpriorart.com/. [Accessed:
          <fpage>01</fpage>
          -Jan-2017].
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Lavigne</surname>
          </string-name>
          , “
          <article-title>Transform any text into a patent application</article-title>
          .” [Online]. Available: http://lav.io/
          <year>2014</year>
          /05/transform-any
          <article-title>-text-into-a-patent-application/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Cloem</surname>
            <given-names>S.A.S.U.</given-names>
          </string-name>
          , “Cloem - reinventing creativity,”
          <year>2017</year>
          . [Online]. Available: https://www.cloem.com/. [Accessed:
          <fpage>06</fpage>
          -May-2017].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>United</given-names>
            <surname>States</surname>
          </string-name>
          Patent and Trademark Office, “Patents.” [Online]. Available: https://www.uspto.gov/patent. [Accessed:
          <fpage>15</fpage>
          -May-2017].
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>A.</given-names>
            <surname>Gatt</surname>
          </string-name>
          and E. Reiter, “
          <article-title>SimpleNLG: a realisation engine for practical applications</article-title>
          ,
          <source>” Proceedings of the 12th European Workshop on Natural Language Generation. Association for Computational Linguistics</source>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>93</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>A.</given-names>
            <surname>Browne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCray</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          , “
          <article-title>The specialist lexicon</article-title>
          ,
          <source>” Natl. Libr. Med. Tech. Reports</source>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>21</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Meuschke</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Gernandt</surname>
          </string-name>
          , “
          <article-title>Decentralized Trusted Timestamping using the Crypto Currency Bitcoin</article-title>
          ,
          <source>” iConference</source>
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. E. Reiter, “
          <article-title>Task-based evaluation of nlg systems: Control vs real-world context</article-title>
          ,
          <source>” Proc. UCNLG+Eval Lang. Gener. Eval. Work.</source>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>32</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>F.</given-names>
            <surname>Hamborg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Meuschke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aizawa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          , “
          <article-title>Identification and Analysis of Media Bias in News Articles,”</article-title>
          <source>in Proceedings of the 15th International Symposium of Information Science</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>