<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The UniGe Experiments on the Search for Earlier Patents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jacques Guyot</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gilles Falquet</string-name>
          <email>gilles.falquet@unige.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karim Benzineb</string-name>
          <email>karim@simple-shift.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Center, University of Geneva - Route de Drize 7</institution>
          ,
          <addr-line>1227 Carouge</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Our goal was to retrieve all the patents related to a given patent (the topic) in a corpus of about two million patents. Each patent had on average less than 10 quotes to find. In this experiment we used the classical, cosinebased approach to calculate the similarity. From the original corpus we extracted the following elements for each patent: The Applicant and Inventor fields; The invention Title in English, French, and German (if existing); The invention Abstract in the three languages (if existing); The invention Claims in the three languages (if existing).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>As for the topics (i.e. the patents whose quotes we were looking for), we kept all the fields.
Although the corpus was multilingual, we evaluated our tools in English only and in the XL mode (10'000 test
patents). We performed five runs and experimented several weight evaluation methods for the cosine-based
approach (TF*IDF, OKAPI, FAST). We also tested a filtering process on the document length (since some
documents were very short) and a filtering process on the patent class.</p>
      <p>For the class filtering, we used an automated supervised classifier to assign one or several IPC category (a the
“subclass” level) to the topic and we built a catalog of the categories which were assigned to each corpus
document. The documents which were retrieved by the cosine method but which did not have any common
category with the topic were filtered out. We obtained the following results:</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>