Experiments with citation mining and key-term
           extraction for Prior Art Search
                            Patrice Lopez and Laurent Romary
     INRIA - Humboldt Universität zu Berlin - Institut für Deutsche Sprache und Linguistik
                 patrice lopez@hotmail.com laurent.romary@inria.fr


                                             Abstract
      This technical note presents the system built for the IP track of CLEF 2010 based
      on PATATRAS (PATent and Article Tracking, Retrieval and AnalysiS), the modular
      search infrastructure initially realized for CLEF IP 2009. We largely reused the system
      of the previous CLEF IP but at a relatively smaller scale and with the improvement
      of three main components:
        • A new citation mining tool based on Conditional Random Fields (CRF).
        • A key-term extraction module developed for technical and scientific documents
          and adapted to patent document structures using a vast ranges of metrics, features
          and a bagged decision tree.
        • An improvement of our multi-domain terminological database called GRISP.
         We used the Okapi BM25 and the Indri retrieval models for the prior art task
      and a KNN model for the automatic classification task under the IPC subclasses. In
      both tasks, specific final re-ranking techniques were used, including multiple regression
      models based on SVM. Although the Prior Art task was more challenging and we used
      a more limited number of retrieval models, we maintained similar results as last year.
      We performed, however, miserably at the classification task, and we consider that an
      instance-based KNN algorithm is not competitive with standard classifiers based on
      preliminary large scale training.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database
Managment]: Languages—Query Languages

General Terms
Measurement, Performance, Experimentation

Keywords
Patent, Prior Art Search, Citation mining, Key-term extraction, Regression models, Re-ranking,
Automatic classification


1     From CLEF IP 2009 to CLEF IP 2010
Our main motivations for participating to CLEF IP are to advance in the comprehension of
scientific and technical information and documents at large, to develop new solutions for managing
the data deluge and the information overload in science, and to facilitate the exploitation and
dissemination of patent information. CLEF IP is one of the rare evaluation event that permits to
tackle these problems.
    We focused our efforts this year on two main aspects: the quality of citation mining from the
patent documents and the extraction of key-terms in order to capture human-understandable
descriptions of the main concepts of a patent. In addition, we further extended and consolidated
our multilingual terminological database (GRISP, General Research Insight in Scientific and tech-
nical Publications) by integrating more knowledge sources and by driving the merging of concepts
from the different sources with machine learning techniques. Regarding the overall architecture,
we reused the framework developed for CLEF IP 2009, called PATATRAS (PATent and Article
Tracking, Retrieval and AccesS), with a more limited number of indexes. This presentation de-
scribes mainly the novel aspects of our work compared to the system of last year. For a detailed
description of the system, the reader is invited to consult our technical note of CLEF IP 2009
[Lopez and Romary, 2009].
    In the following description, the collection refers to the data collection of approx. 2,6 millions
documents corresponding to 1,3 million European Patents. This collection represents the prior art.
The training set refers this year to the 200 documents of training topics provided with judgements
(the relevant patents to be retrieved). The prior art (PA) patent topic refers to the 2000 patents
for which the prior art search is done and the classification (CL) patent topics are the 2000 patents
to classify.

1.1    Prior Art Searches
Following the first CLEF IP in 2009, the prior art task this year has been reviewed to coincide
more closely with the actual prior art performed by patent examiners. The PA patent topics are
normal unexamined applications (i.e. A1 or A2 publications) in only one language and without
amendments of the description. The description of the granted patent publications often includes
acknowledgement of the most important document of the prior art which has been identified during
the search phase. The topic documents are thus more challenging than last year because they offer
less multilingual information and less document citations.
    A fully automated prior art search based on the existing search reports produced by the patent
offices has inherent limitations in relation to patent families, to the influences of procedural aspects,
the impact of limited search tools of the patent examiners, and the absence of non patent literature
[Lopez and Romary, 2009]. We could however note two issues that could be addressed for a future
edition of the evaluation forum:
   • The problem of missing patent application content for some PCT applications arriving to
     the European phase: The European Patent Office does not re-publish patent applications
     coming from the PCT phase, and thus it is more difficult to retrieve these documents than
     for a patent examiner who typically searches the full application documents from the WO
     patent publications.
   • The designation of the expected documents: The expected result this year were expressed
     as a list of patent publications (i.e. with a kind code) rather than simply a reference to
     a patent application. As a A publication is for instance always as relevant as the corre-
     sponding B publication (because the scope of B is always included in the one of the initial
     application document), the other publications for the same patent applications needed to be
     also considered as relevant. We view this way of building the expected results problematic
     because a patent with many publications will be repeated more often in the expected results
     as a patent with only one publication, and thus will have a stronger positive impact on the
     retrieving score. More generally this distinction between the publications appears artificial,
     because the final choice of citation of a particular publication by a patent examiner is very
     subjective.
    We would like to thank the organizers for the progress toward a realistic prior art task which is
remarkable and very beneficial for the participants. The developed systems could already be prof-
itable to the actual search work of thousand of patent examiners and patent information specialists.
This evaluation framework has also started to offer a sound basis for analyzing experimentally the
impact of particular techniques on patent collections.

1.2    Automatic classification
CLEF IP this year introduced an automatic classification task. A set of 2 000 patent documents
should be classified under one or several IPC subclasses (i.e. the four first characters of the IPC
classification). The number of IPC sub-classes is approx. 600. This classification task corresponds
to what is usually called the pre-classification [Krier and Zacc, 2002], where a patent application is
routed to the appropriate a general level technical domain for being processed by the technically
competent examiners. The classification is significantly more challenging as the complete IPC
classification contains more than 60.000 subdivisions.


2     Advanced citation mining
2.1    The visible citation network
We observed last year a very strong impact of the interrelated cited patents on retrieval results.
Citation relations between patents through time are manifestations of technological improvements
and evolutions. These relations could be exploited for connecting a new patent application to a
potentially relevant subset of the patent collection. The first kind of citations are the citations
present in the search reports established by the patent examiners. This information are imme-
diately exploitable because fully specified in the MAREC format (i.e. the XML format for the
patent documents used in CLEF IP). Table 1 presents an overview of these citations available in
the search reports.
    Only the subset of the citations (EP) corresponds to documents present in the collection. It is
possible from a citation to a non European patent to obtain the possible European version using
patent family information. A patent family gathers all the different version of a patent application
among the different geographical areas. The EPO proposes as web service (Open Patent Service,
OPS) the access to the INPADOC database which permits to retrieve the possible European
application of a given patent family given a non-European patent number. This service is however
slow and limited by a fair use agreement. While it cannot be envisaged for a large number of
patent references as present in the collection, we carried out a family look up for the patent topic
set.

                        Source                     Authority               #
                        Search Report              all              4 198 873
                                                   EP                 898 206
                        Description                all              6 257 511
                                                   EP                 890 754
                        Description + Family       EP           not processed

                   Table 1: Overview of citation relations in the patent collection.


2.2    Increasing the citation density
A scientific and technical work is often a contribution to previous existing works. Acknowledging
and referring to previous realization and documents is therefore an inherent characteristic of
any scientific and technical documents, including patent documents, which appears important to
address. Following EPO’s statistics, independently from the first kind of citation present in the
search report, the description body of patent application contains in average 9 citations from
the initiative of the applicant, 7,5 references to other patents and 1,5 references to non patent
literature. These citations correspond to the applicant’s view of the state of the art and is a
legal constraint (Rule 27(b) of the EPC, European Patent Convention). It is thus important for
a patent examiner to evaluate these documents and possibly to cite some of these documents in
the search report.
    A patent document can contain several hundred of such references, while the number of ci-
tations in the search report is rarely more than ten. Extracting accurately these references can
provide useful information for starting a search and understanding the key aspects of an applica-
tion. The difficulty of this extraction task is a strong variability of contexts and patterns. Last
year, we used a basic set of regular expressions for extracting patent citations in patent text bodies.
The regular expressions were created based on a set of approx. 50 patterns of patent citations.
Some analysis showed that we were missing at least 40% of the citations and that more advanced
techniques were necessary.

                   Compounds can exhibit anti-hepatitis C activity by inhibiting viral and host cell targets
                   required in the replication cycle.A number of assays have been published to assess
                   these activities. A general method that assesses the gross increase of HCV virus in
                   culture is disclosed in U.S. Ser. No. 08/221,816 to Miles et al. In vitro assays have
                   been reported in Lohmann et al, J. of Biol. Chem., 274:10807-10815, 1999. A cell line,
                                    1
                                Extract

                    type: application
                    issuing auth. : US                         2 Parse
                    number: 08/221816

                    type: patent
                    issuing auth. : US                         3   Consolidate
                    number: 5738985

                    type: patent
                    issuing auth. : EP                         4 Family lookup
                    number: 0693126


      Figure 1: Example of patent reference extraction, parsing, consolidation and family lookup.

    The new patent reference extraction module performs the following processing steps, as illus-
trated by Figure 1:
  1. Identification of reference strings: The text body is first extracted from the patent
     document. The patent reference blocks are first indentified in the text body by a specific
     Linear-Chain CRF model.
  2. Parsing and normalization of the extracted reference strings: The reference text is
     then parsed and normalized in order to obtain a set of bibliographical attributes. References
     to patent are parsed and normalized in one step by a Finite State Transducer (FST) which
     will identify (i) if the patent is referred to as a patent application or a patent publication,
     (ii) a country code, (iii) a number and (iv) a kind code.
  3. Consolidation with online bibliographical services: Different online bibliograhical
     services are then accessed to validate and to enrich the identified reference. For patent
     references, we use OPS (Open Patent Service1 ), a web service provided by the EPO for
     accessing the Espacenet patent databases. This step permits for instance to retrieve the
     patent numbers from a reference to a patent application number.
  1 http://ops.espacenet.com
    4. Family lookup: For the citations extracted from the patent topics, in case the citation
       is a non-European patent, we access OPS for patent family information and try to identify
       the corresponding European patent.

    The CRF model has been trained based on 200 patent documents corresponding to approx-
imatively 2 000 patent citations. In [Lopez, 2010], we evaluated the f-score of the extraction of
patent reference blocks at 0.9540 based on a manually annotated corpus of patents from different
sources, while the previous state of the art was around 0.75. In 97.2%, we were then able to parse
correctly the citation block and identify the correct patent attributes.
    The tool is also able to extract non-patent literature references with a specific CRF model,
to parse the extracted reference for identifying a set of 12 bibliographical attributes (author,
title, journal, date, etc.) and to consolidate the result with an access to Crossref. Although
potentially very relevant to the Prior Art task, in particular in certain technical domains such as
bio-technologies, computer and chemistry, this functionality has, however, not been used in the
present work because of time and processing power constraints.
    The result of these extraction are presented on Table 1 for the collection and on Table 2 for
the set of topic patens.

                            Source                     Authority         #
                            Search Report              all                0
                                                       EP                 0
                            Description                all           18 876
                                                       EP             2 946
                            Description + Family       EP             7 706

                  Table 2: Overview of citation relations in the set of topic patents.


3     Key-term extraction of patent documents
3.1     Approach
Key terms (or keyphrases or keywords) provide general information about the content of a docu-
ment. Key-terms constitute good topic descriptions of documents which can be used in particular
for information retrieval, automatic document clustering and classification. Among the extracted
terms for a given scientific document in a given collection, which key terms best characterize this
document?
     Our work is based on the system realized for Semeval 2010, task 5 Automatic Keyphrase
Extraction from Scientific Articles [Lopez and Romary, 2010b]. Candidate phrases up to 5-grams
are extracted from the textual content of the document. Phrases beginning or ending by a stopword
are discarded. The ability of a candidate phrase to be considered as a key-term is estimated in a
supervised manner by a bagged decision tree based on the key-terms selected by the authors and
the readers of the training documents. The advantage of using examples annotated by the authors
and the readers for selecting the key-terms is that the resulting extracted topic description will
still be comprehensible for a human. The machine learning algorithm use three set of features:
    • a first set of structural features characterizing the position of a term with respect to the doc-
      ument structure for each candidate: present in the title, in the abstract, in the introduction,
      in at least one section titles, in the conclusion, etc. the relative position of the candidate
      phrase in the document is also used,
    • a second set of content features which tries to captures distributional properties of a term
      relatively to the overall textual content of the document where the term appears or the
      collection. For this we use a set of metrics: Generalized Dice Coeficient (GDC) as introduced
      by [Park et al., 2002], TF-IDF and the frequency of the candidate phrase to be selected as
      key-term in the global corpus.
   • finally, a set of Lexical/Semantic features which are produced exploiting our multilingual
     terminological database GRISP and Wikipedia were introduced.
    We further applied a post-ranking based on the statistics observed on HAL2 research archive.
HAL contains approx. 139,000 full texts articles described by a rich set of metadata, often including
author’s keywords. In Semeval 2010, we achieved a f-score of 27.5 for top the 15 key-terms.
This level of performance must be considered knowing that the expected key-terms used for the
evaluation were a relatively small and subjective selection by the authors and the readers.
    The features have been adapted from this initial implementation for scientific articles to patent
publications. The structure features were changed by using the available structural tag of the
MAREC XML format. The TF-IDF were computed on the whole patent collection. Finally a set
of 120 patent documents with annotated keywords have been used to retrain the bagged decision
tree.

3.2    Extraction results
Table 3 give an example of the key-term extraction for the patent publication EP0381288A1. The
score associated to a phrase evaluates to which extend the phrase can be viewed as a key-term for
the document. We can see that this extraction corresponds at the same time to a topic modeling
of the document, to a human-understandable summary of the key content of the document close to
usual keywords attributed to a scientific and technical article, but also can be viewed as synthetic
queries for which the documents itself is relevant.

           word index                  0.298805    structural ambiguity                 0.0611575
           syntactic interpretations   0.226326    word-specific ambiguity              0.0611575
           governor                    0.21987     interpretation index                 0.060762
           parsing process             0.193613    parsing analyses                     0.0565435
           word sequence               0.192174    representing the rank                0.0550731
           choice point                0.173721    multiple syntactic interpretations   0.0525496
           intermediate nodes          0.1588      natural language word                0.0520414
           tree structures             0.152594    syntactic relation                   0.0501529
           multiple analyses           0.152235    tree                                 0.0494087
           index                       0.149352    interpretation                       0.0492798
           syntactic network           0.147376    word in the sequence                 0.0492702
           maximum phrases             0.143614    phrases                              0.0488179
           dependency grammar          0.142967    dependency                           0.0470071
           top node                    0.131585    parsing algorithm                    0.0467525
           consistency check           0.129689    words of the sequence                0.0460244
           word                        0.126933    analyses                             0.0428058
           language word               0.125774    sentence                             0.0417408
           grammar                     0.122467    programming language                 0.0415976
           parser                      0.11987     natural language                     0.0403096
           syntactic                   0.0943424   multiple syntactic                   0.0394109
           parsing                     0.0787452   dependent                            0.0384868
           pointer node                0.0656035   choice                               0.038086
           unambiguously coding        0.0645593   ambiguity                            0.0371675

               Table 3: Example of key-term extraction for document EP0381288A1.

   This processing scaled well the whole collection of documents since the extraction took on a
low-end hardware in average 0.7 second per patent application.
   2 HAL (Hyper Article en Ligne) is the French Institutional          repository for research publications:
http://hal.archives-ouvertes.fr/index.php?langue=en
4       Extension of GRISP
GRISP (General Research Insight in Scientific and technical Publications) is a multilingual ter-
minological database based on the principles of ISO 16642 (TMF – Terminological Markup Frame-
work) [Romary, 2001], a generic onomasiological (concept to word) model. This conceptual frame-
work facilitates the combination of heterogeneous specialist resources and in different languages.
[Lopez and Romary, 2010a] presents the overall framework, the different technical and scientific
resources which have been combined and the usage of a machine learning approach for deciding
when to merge two concepts coming from different resources in a single, enriched concept.
    As compared to GRISP used in 2009, ChEBI3 has been integrated. ChEBI is a freely
available dictionary of molecular entities developed at the European Bioinformatics Institute
[Degtyarenko and al., 2008]. ChEBI is a valuable source of chemical vocabulary with approx.
42.000 concepts, 97.000 terms, 28.000 semantic relations and multilingual terms in 5 languages.
In addition, we update the partial Wikipedia resources with the latest 2010 XML dumps.


5       Overall Description of the Prior Art System
5.1          System architecture
Figure 2 gives an overview of the realized system. The system is close to the system realized for
CLEF IP 2009, but with a limited number number of retrieval models and a redefined phrase
retrieval model based on the extracted key-terms resulting from the preprocessing of the whole
collection and the topic patents.


                                    Index      Index        Index    Index
                                   Lemma      Lemma        Lemma    Concepts
  Patent
                     Meta-data        en         fr           de
 Collection
                       RDB


                                              LEMUR 4.9              Ranked
                      Initial                                        results             Ranked
                                            - Okapi BM25                                             Post-
      Init           Working                                           (5)     Merging   shared
                       Sets                                                                         ranking
                                                                                         results
                                              INDRI 2.9

                                                                                                        Final
                                                                                                       Ranked
                                               Query                                                   results
    Patent                                   Lemmas en
                    Tokenization
    Topics
                                                Query
                                              Lemmas fr
                   POS tagging
                                               Query
                                             Lemmas de
                     Key-term
                     extraction                Query
                                             Phrases en
                     Concept                   Query
                     tagging                  Concepts


Figure 2: System architecture overview of PATATRAS millésime ”CLEF IP 2010” for query processing.
The arrow represents the main data flow from the patent topic to the final set of ranked results.
    3 http://www.ebi.ac.uk/chebi
5.2    Document preprocessing
The document preprocessing is similar as the previous year with two differences: the addition of
the new citation mining processing and the extraction of key-terms as explained in section 2 and
3, and no systematic extraction of all phrases. The preprocessing result in particular in a database
storing all metadata of the collection, including the new extracted citations and the key-terms. A
few metadata fields were normalized: inventor and applicant names, similarly as last year, and a
particular effort was made this year on cleaning and normalization of IPC and ECLA classes.
    The concept tagging based on the controlled terminology of GRISP is similar as last year. The
concept disambiguation was still realized on the basis of the ECLA classes (or by default the IPC
classes) of the processed patent.

5.3    Indexes
The four following indexes were build using the Lemur toolkit [lem, 2001-2010] (version 4.9):
   • For each of the three language (English, French, German), we built a full index at the lemma
     level.
   • A crosslingual concept index was built using the list of concepts identified in the textual
     material for all three languages.
   Similarly as last year, we do not index the collection document by document, but considered
a ”meta-document” corresponding to all the publications related to a patent application.

5.4    Retrieval models
We used the two following well known retrieval models:
   • Okapi weighting function BM25 (K1 = 1.5, b = 1.5, K3 = 3).
   • Indri
Although KL-Divergence with Jelinek-Mercer smoothing (λ = 0.4) was the best performing re-
trieval model last year, it is also the most time and resource consuming retrieval algorithm. As
our development timeframe was this year relatively limited, we did not submit runs including the
result of this retrieval model.
     The two models have been used with each of the previous four indexes, resulting in the pro-
duction of 5 lists of retrieval results for each topic patent. Similarly as last year, the queries for
lemma and concept representations were build based on all the available textual data of a topic
patent.
     There are many possibilities for exploiting a topic representation based on key-term extraction.
For instance, in the context of language model information retrieval, [Zhou et al., 2007] uses a set of
extracted keyphrases for building a topic signature language model used for a semantic smoothing
method. We applied in this work a much simpler approach which can be viewed as a baseline.
We used the Indri retrieval model applied to the English lemma index and built queries mixing
phrases and single word terms. Due to the limit of the numbers of phrases in a query which could
be processed in a reasonable time, we limit the number of multi-word key-term to a constant
N, and then add the rest of phrases as individual words. For instance, for a list of n key-terms
(tp , sp )p where sp is the score associated to the term tp , having a term formed by multiple words
w, tp = (wpi )i , we build the query as follow:

      #weight(s0 #1(t0 ) s1 #1(t1 ) ... sN #1(tN ) ... sp wpi ... sn wni )

    In our work, we limited the number N of phrases present in the Indri query to 4. Following
this construction, an Indri query takes approximatively 15 second to be processed.
   The baseline results of the different indexes and retrieval models are presented in Table 4,
column (1). Given that this year the patent topic contains text content in only one language (the
main language of application), the results presented in this table are restricted to the set of topics
having text in this language, i.e. only 134 queries for French, 519 for German and 1 959 for English
over the total of 2 000 patent topics. This restriction explains the high MAP results for French
and German indexes.

                           Model     Index       Language     (1)        (2)
                           BM25      lemma       en           0.0842     0.1628
                           BM25      lemma       fr           0.124      0.2185
                           BM25      lemma       de           0.1081     0.1869
                           Indri     phrase      en           0.0758     0.1597
                           BM25      concept     all          0.0655     0.1529
                           KL        lemma       en           0.0911     0.171
                           KL        lemma       fr           0.1309     0.2244
                           KL        lemma       de           0.1085     0.1887

Table 4: MAP results of the retrieval models, Prior Art task. (1) Base MAP, Normal set (2 000 queries)
(2) Map with initial working sets, normal set (2 000 queries). The results for a language dependent index
are produced only for the patent topics having text in this language. The results for KL were not part of
the submitted runs, they have been produced while preparing this technical note and are reported here
for information.

    The initial working sets have been created via an iterative process similarly as last year,
exploiting cited documents and the whole range of available metadata. The process could take
benefit this time from a larger number of citations extracted from the description to seed the sets.
Using these working sets reduce the search space while containing approx. 75% of the expected
documents. As one can see on Table 4, column (2), the initial working sets provide a significant
improvements in term of retrieval precision which is superior to the one observed last year. The
working sets remain, however, slow to build, are based on manual and intuitive rules and appear
difficult to improve in term of recall. We plan to replace the current algorithm by a machine
learning approach which could drive the process of selecting interesting patent documents in a
monotonic process rather than iteratively.

5.5     Merging of results
The merging of the five result sets was realized as last year with a SVM model using a set of 4
631 training patents. We did not exploit the additional topic set of last year (10 000) and did not
rebuild a specific model this year due to lack of time. As a result, the combination was not as
effective as last year, but has still provided an improvement over the individual result sets.

5.6     Post-ranking and final results
We re-use the same final re-ranking model as build for CLEF IP 2009. This re-ranking permits
in particular to boost the score of the patents initially cited in the description of the topic patent
and the ECLA classes, resulting in a significant improvement. The regression model was trained
using the set of 4 631 training patents which were compiled for CLEF IP 2009.

                                     Measures        small      large
                                     MAP             0.2731    0.2645
                                     Prec. at 5      0.4244    0.4209
                                     Prec. at 10     0.3625    0.3482

Table 5: Evaluation of official runs for the small (400 topic patents) and large (2 000 topic patents) topic
sets.
   The final results are presented on Table 5, and shows comparable accuracy as last year. Given
that the prior art task of this year was more challenging as the topic patents were real application
documents, and given that we reduced the number of retrieval model and not updated our re-
gression models for result merging and re-ranking, this result shows the positive impact of a high
quality extraction of applicant’s citations in the patent descriptions and the potential of key-term
extraction.


6      Automatic Classification task
As we started to prepare the classification task very late, we could not experiment any algorithms
requiring a training on the document collection. We thus opted for an instance-based approach,
and more particularly for a KNN algorithm, simply re-using the existing system build for the prior
art task. We use the existing prior art search system for providing a list of ranked results for a given
patent topic to be classified and the KNN implementation of WEKA [Witten and Frank, 2005],
with N = 25. Such algorithm could be developed and produced in just a few hours.

                                  Run                 Metric        Score
                                  patatras            MAP           0.5083
                                                      Prec. at 1    0.56
                                                      Prec. at 5    0.252
                                  ssft CEC0 run7      MAP           0.7951
                                                      Prec. at 1    0.835
                                                      Prec. at 5    0.3662

    Table 6: Evaluation of official runs for the classification task with the best system (Simple Shift).

    Unfortunately, our system suffered from several implementation errors which make the inter-
pretation of the results difficult. The final results are presented in Table 6 with a comparison
with the best run. The difference between the two systems is very important. Even by correct-
ing implementation errors, we consider that an instance-based KNN algorithm is not competitive
with state of the art classifiers based on preliminary large scale training, and a fortiori with the
advanced system realized by Simple Shift.


7      Future Work
We plan to focus our future efforts on the automatic recognition and the exploitation of the
structures of patent documents. The main goal is to improve the formulation of the queries and
to build more specialized indexing processes. The recognition of entities of special interest such as
non patent references and numerical values is a second axis of future work which appears promising
in certain technical domains such as biotechnology, chemistry and computer sciences.


References
[lem, 2001-2010] 2001-2010. The Lemur Project . University of Massachusetts and Carnegie Mellon
     University.
[Degtyarenko and al., 2008] Degtyarenko, K. and al., 2008. ChEBI: a database and ontology for
    chemical entities of biological interest. Nucleic Acids Res., 36:344–350.
[Krier and Zacc, 2002] Krier, M. and F. Zacc, 2002. Automatic categorisation applications at the
     european patent office. World patent Information, 24:187–196.
[Lopez, 2010] Lopez, P., 2010. Automatic Extraction and Resolution of Bibliographical References
    in Patent Documents. In H. Cunningham, A. Hanbury, and S. Rüger (ed.), First Information
    Retrieval Facility Conference (IRFC). Vienna, Austria: Springer, Heidelberg.

[Lopez and Romary, 2009] Lopez, P. and L. Romary, 2009. Multiple retrieval models and regres-
    sion models for prior art search. In CLEF 2009 Workshop, Technical Notes. Corfu, Greece.
    http://hal.archives-ouvertes.fr/hal-00411835.
[Lopez and Romary, 2010a] Lopez, Patrice and Laurent Romary, 2010a. GRISP: A Massive Mul-
    tilingual Terminological Database for Scientic and Technical Domains. In Seventh interna-
    tional conference on Language Resources and Evaluation (LREC) 2010 . La Valette, Malte.
    Available at http://hal.inria.fr/inria-00490312.
[Lopez and Romary, 2010b] Lopez, Patrice and Laurent Romary, 2010b. HUMB: Automatic Key
    Term Extraction from Scientic Articles in GROBID. In SemEval 2010 Workshop. Uppsala,
    Suède. Available at http://hal.archives-ouvertes.fr/inria-00493437.
[Park et al., 2002] Park, Y., R.J. Byrd, and B.K. Boguraev, 2002. Automatic glossary extraction:
    beyond terminology identification. In Proceedings of the 19th international conference on
    Computational linguistics-Volume 1 . Association for Computational Linguistics.
[Romary, 2001] Romary, L., 2001. An abstract model for the representation of multilingual termi-
    nological data: Tmf - terminological markup framework. In TAMA (Terminology in Advanced
    Microcomputer Applications). Antwerp, Belgium. Available at http://hal.inria.fr/inria-
    00100405.
[Witten and Frank, 2005] Witten, I.H. and E. Frank, 2005. Data Mining: Practical machine
    learning tools and techniques. San Francisco: Morgan Kaufmann, 2nd edition.
[Zhou et al., 2007] Zhou, X., X. Hu, and X. Zhang, 2007. Topic signature language models for ad
    hoc retrieval. IEEE Transactions on Knowledge and Data Engineering:1276–1287.