-

The UniGe Experiments on the Search for Earlier Patents

Jacques Guyot

Gilles Falquet

gilles.falquet@unige.ch 0

Karim Benzineb

karim@simple-shift.com 0 0 Computer Science Center, University of Geneva - Route de Drize 7 , 1227 Carouge , Switzerland

Our goal was to retrieve all the patents related to a given patent (the topic) in a corpus of about two million patents. Each patent had on average less than 10 quotes to find. In this experiment we used the classical, cosinebased approach to calculate the similarity. From the original corpus we extracted the following elements for each patent: The Applicant and Inventor fields; The invention Title in English, French, and German (if existing); The invention Abstract in the three languages (if existing); The invention Claims in the three languages (if existing).

As for the topics (i.e. the patents whose quotes we were looking for), we kept all the fields. Although the corpus was multilingual, we evaluated our tools in English only and in the XL mode (10'000 test patents). We performed five runs and experimented several weight evaluation methods for the cosine-based approach (TF*IDF, OKAPI, FAST). We also tested a filtering process on the document length (since some documents were very short) and a filtering process on the patent class.

For the class filtering, we used an automated supervised classifier to assign one or several IPC category (a the “subclass” level) to the topic and we built a catalog of the categories which were assigned to each corpus document. The documents which were retrieved by the cosine method but which did not have any common category with the topic were filtered out. We obtained the following results: