Chemical and Biological Entity
                  Recognition System from Patent Documents
           Hongchang Lai                                             Shuo Xu                                     Lijun Zhu
       Information Technology                            Information Technology                        Information Technology
          Supporting Center,                                Supporting Center,                            Supporting Center,
Institute of Scientific and Technical                    Institute of Scientific and                   Institute of Scientific and
         Information of China                        Technical Information of China                Technical Information of China
No. 15 Fuxing Rd,.Haidian Distirct,                 No. 15 Fuxing Rd,.Haidian Distirct,           No. 15 Fuxing Rd,.Haidian Distirct,
     Beijing 100038, P.R. China                        Beijing 100038, P.R. China                    Beijing 100038, P.R. China
          +86 10 5888 2447                                  +86 10 5888 2447                              +86 10 5888 2447
      laihc2013@istic.ac.cn                                   xush@istic.ac.cn                             zhulj@istic.ac.cn


ABSTRACT                                                                     situation will be improved continually, since there is an
                                                                             increasing    interest on   patent    mining,   such   as
It is crucial to explore the chemical and biological space covered           BIOINFORMATICS [3], BioCreative [4], JNLPBA [5] and
by patent documents. In order to recognize chemical and                      iPaMin [6].
biological entities, a recognition system is developed on the basis
                                                                             An Annotated Chemical Patent Corpus [8] was published by
of open-source machine learning and natural language processing
                                                                             Akhondi, which enables the development of the chemical and
(NLP) toolkits. The system processing pipeline consists of three
                                                                             biological entity recognition system. Even so, it is still a rather
major components: pre-processing (sentence detection,
                                                                             challenging task to automatically recognize chemical and
tokenization), recognition (conditional random field (CRF) based
                                                                             biological entities from non-structural documents, especially
approach), and post-processing (rule-based approach). The paper
                                                                             patents [7], since patents are complex legal documents that even
introduces each part in detail. Finally, extensive experiments on
                                                                             contain up to hundreds of pages.
annotated chemical patent corpus are conducted, and the
balanced-F measure is 69.20% with 10-fold cross validation. The              In this paper we explore the chemical and biological entity
results indicate that the performance on patent documents is                 recognition system from patent documents using similar
slightly lower than that of counterpart on paper and news corpus.            approaches in [11]. Thus, one can see whether it is feasible by
                                                                             just borrowing some methods. The organization of the rest of the
                                                                             article is as follows. Section 2 summarized the overview of the
Keywords                                                                     annotated patent corpus. Section 3 introduces the recognition
Conditional Random Field (CRF); Chemical and Biological                      system and the methods we used. Section 4 decrypts the
Entity Recognition; Patent Mining; Cross Validation                          annotated corpus we used and some information of our
                                                                             experiments.
1. INTRODUCTION
It is crucial to explore the chemical and biological space covered           2. DATASETS OVERVIEW
by patent documents. For example, it can help speed-up the                    Akhondi et al have produced gold standard chemical patent
early-stage medicinal chemistry activities [1] [2]. Though patent            corpus of which 47 patents have been annotated by at least three
documents contain many valuable chemical and biological                      annotators. The full-text patents and annotated entities are
entities, such as chemical compounds, genes, proteins, drug and              publicly accessible at www.biosemantics.org.
so on, automatic recognition systems from patent documents are
still very limited.                                                          We analyzed the training and harmonized dataset and found
                                                                             some nested chemical and biological entities in the harmonized
However, as for paper and news documents, many identification                set. In our system, CRF++ is adopted for the actual
approaches are proposed and resulting systems are also                       implementation to process the sequence label problem. Since
developed.                                                                   CRF++ cannot identify the nested entities, we just omit the less
In our opinion, the reasons are two-fold: (a) the annotated patent           spanned entities. .There are 1239 entities of the type
corpus are not available to public; (b) the patents are complex              "OCRERRORSPELL" and "OCRERRORLINE" in the original
legal documents which are very difficult to understand. But the              annotation corpus, however some of them are nested. Finally, we
 Copyright © 2015 for the individual papers by the papers' authors.Copying   reduced the entities amount from 37,776 down to 37,288,
 permitted for private and academic purposes.This volume is published and    removing 488 nested entities. The harmonized set was produced
 copyrighted by its editors.                                                 from the 47 common patents, including a total of fourteen classes,
 Published at Ceur-ws.org                                                    9857 unique terms and 37,288 annotated terms (see table 1).
 Proceedings of the Second International Workshop on Patent Mining and its
 Applications (IPAMIN). May 27–28, 2015, Beijing, China.
                                       Sentences
        Annotated
                                       Detection              CRF-based                  Rule-Based
     Chemical Patent                                                                                                     Results
                                                               Approach                   Approach
         Corpus
                                   Tokenization


Figure 1 The simplified system processing modules. Pipeline includes three major components: pre-processing (sentence detection,
tokenization), recognition (CRF-based approach) and post-processing (rule-based approach)


The results indicate that IUPAC (International Union of Pure and      The process as showed in Figure 1, our system looks like a
Applied Chemistry) entities and generic names have been               serialized pipeline consisted by three major components. At first,
annotated obviously more than any other chemical type. On the         annotated chemical patents would be detected sentence
other hand, InChI (International Chemical Identifier), CAS            boundary. The sentences would be split by tabs ("\t"). And then,
(Chemical Abstracts Service) registry numbers and SMILES              each detected sentenced is tokenized as many tokens one by one.
(Simplified molecular input line entry specification) appear          Secondly, chemical and biological entity is extracted from corpus
rarely in the chemical patents. Since we removed one of the label     with a CRF-based approach. A 10-fold cross validation method is
tag of entities which have two or more tags, the count of results     adopted in order to evaluate the effect of our recognition system.
would be a little bit different with [8].                             Finally, some post-processing steps include a rule-based
                                                                      approach. Each step would be outlined in details in the following
                                                                      subsections.
    Table 1 Number of annotated terms and unique terms in the
     harmonized set of the gold standard corpus after removing        3.1 Pre-process: sentence detection, and
                           nested entities
                                                                      tokenization
                 Description             Annotated   Unique           There are two kinds of document for each patent in annotated
                                         Terms       Terms            chemical patent corpus, the original patents and the entity
M                IUPAC                   13943       4592             annotated for each part. In corpus, each patent was divided into
                                                                      several different partitions. Each partition contains different parts
I                SMILES                  20          20               of the patent document. Generally, each subdocument is irregular
Y                InChi                   0           0                in each line which is a sentence or not. For example, in the
D                Trademark               2355        897              document named US4659716_0001 of the training_set, some
                                                                      lines are the metadata features of patent such as the title,
B                Abbreviation            2087        146              abstract, inventors and so on; some lines have two or more
C                CAS number              6           5                sentences.
F                Formula                 1115        160              In the system, the openNLP sentence detector toolkit is utilized.
R                Registry Number         140         95               Detecting sentence boundary is a challenging work by the reason
                                                                      of the ambiguous punctuation marks. For the further performance
G                Generic                 8381        811              of the sentence boundary detection, we gathered the many
T                Target                  3221        654              abbreviations sets of the corpus in advance, such as var., e.g., sp.
Disease          Disease                 3765        1205             Especially in annotated documents, such as the entity contains
                                                                      the full point marks, for example "EC 3.4.24.11" or
MOA              MOA                     1016        197              "MgCl2.6H2O" etc. Then we generated several rules, for
OCRERROR-        Spelling error          1189        1029             instance if current sentence ends with these abbreviations or
SPELL                                                                 comma, the current and subsequent sentences are merged into a
OCRERROR-        Spurious       line     50          46               new one. And the metadata features in patent as mentioned
LINE             break                                                before, each line is regarded as a sentence because the metadata
                                                                      features are shorter than other sentences and have less
                 Total                   37288       9857             information about the entity.
                                                                      In the end, all the sentences were combined into a bulky
                                                                      document. Each line of the document is a subpart of the patent.
3. SYSTEMS DESCRIPTION AND                                            The line format is as follows:
METHODS                                                               fileID    sentence. sentence.
Based on the summary of the principal methods used in the
MUC-4 (Mucin 4) systems, Hobbs proposed a generic                     Each line begins with the file id of the source of the sentence
information extraction system [10] which consists of ten              followed by one tab. Sentences split by space " ".
modules. It is the theoretical basis based on a large amount of
                                                                      The tokenizer in the system is based on the OpenNLP toolkit. It
practice for our system. On the other hand, we refer to the
                                                                      can divide the sentence above into some reasonable tokens what
recognizing chemical entities system published in [11].
we need. However, it would get a poor result by using the            (Broyden–Fletcher–Goldfarb–Shanno)) method to do the
original tokenizer, which cannot be applied to sequence labeling     unconstrained optimization for parameter estimation. On the
problem. Then some improvement approaches are expected to be         other hand, CRF++ use line search to compute the step size of
adopted, and we get much better fine-grained tokens. Such as         the unconstrained optimization problem.
the entity " (S)-(-)-α,α-diphenyl-2-pyrrolidinemethanol" in          The annotated entity in patent corpus can be classified into one
US5650521_ 0003, the entity type is "M" which means "IUPAC".         of the fourteen classes:


                Table 2 Examples of Chemical
                   component entity labels
                                                                     4-tag method is used to label the chemical entity with B I E O,
token     …    (     S    )                -     (     -
                                                                     which means "beginning of the entity", "word in the entity", "end
                                                                     of the entity" and "the other words". And some nested annotated
label     O    B     I    I                I     I     I             entities mentioned in section 3.1 are uniform to a same type,
                                                                     because the CRF++ cannot process the nested entities.
token     )    -     α    ,                α     -     dipheny       Harmonized set merged by the annotations of the 47 patents
                                                                     annotated by more than three groups is used as the training set
label     I    I     I    I                I     I     -I            with different entity types (chemicals and their sub entities,
                                                                     diseases, MOAs, and targets).
                          pyrrolidinem     …
token     -    2     -
                          ethanol          .                         3.3 The features for CRF
                                                                     Our system exploits four different types of features:
label     I    I     I    E                O                         General linguistic features. Our system includes the original
                                                                     tokens, as well as stemmed tokens, as features using the Porter’s
                                                                     stemmer from Stanford CoreNLP.
As shown in Table 2, the punctuation marks (brackets, dashes,        Characteristic features. Since many entities contains numbers,
etc.), Greek symbols, numbers are regarded as the isolate tokens.    Greek letters, Roman numbers, amino acids, chemical elements,
In the annotation documents, the type "OCRERRORSPELL" and            and special characters, our system calculates several statistics as
"OCRERRORLINE" are marked in the end of each document.               features for each token, including its number of digitals, number
Meanwhile, the entities of these two types also have the right       of upper- and lower-case letters, number of all characters and
entity types. Such as in the US20050222261_0003:                     presence or absence of specific characters or Greek letters,
T109     D                4726 4738    siruvastatin                  Roman numbers, amino acids, or chemical elements.

T343     OCRERRORSPELL 4726 4738                 siruvastatin        Case pattern features. Similar to [12], the upper case alphabetic
                                                                     character, the lower case one and any number (0-9) are replaced
However, some of OCRERRORSPELL entities have only one                by ’A’, ’a’, ’0’ respectively. Moreover, our system also merges
type. It means some of them are nested in entity types, but others   consecutive letters and numbers and generated additional single
have a unique type label. For consistency, the uniform type labels   letter ’a’ and number ’0’ features.
are given for each entity to get rid of nested types. There are
1239 entities of the type "OCRERRORSPELL" and                        Contextual features. For each token, our system includes the
"OCRERRORLINE" in the original annotation corpus, however            linguistic features of two neighboring tokens from each side.
some of them are nested. Finally, we reduce the entities amount      There is an example of the entities features:
from 37,776 down to 37,288, removing 488 nested entities.

3.2 Recognition: crf-based approach                                              Table 3 An example for entity features
As mentioned above, the chemical and biological entity
                                                                      Stemmer          Amino Acid       Element          Symbol
recognition problem is treated as a sequence label problem
(Table 2). Conditional random fields, as a framework for              Lymphocyte       true             true             false
building probabilistic models to segment and label sequence
                                                                      Roman            Num        Of    Num     Of       Num     Of
data[13], avoids a fundamental limitation of MEMMs (maximum
                                                                                       Digitals         Upper Case       Lower Case
entropy Markov models) and other discriminative Markov
                                                                                                        Letters          Letters
models based on directed graphical models, and offers several
advantages over hidden Markov models and stochastic grammars.         False            0                0                11
CRF can pick up the context into account; e.g., the linear chain      Length           case Pattern     brown            label tag
CRF in natural language processing predicts sequences of labels
for sequences of input samples. There are observations and
random variables , the random variables are conditioned on           3.4 Post-processing: rule-based approach
  . the conditional distribution      is then modeled. Due to        On closer examination, we find that the results of CRF approach
some polynomial equations easily computed by Newton’s method,        include some false positive chemical and biological entities. So,
the CRF++ adopts the L-BFGS (Limited-memory BFGS
we developed several additional rules to remove them. In                IV CHEMDNER competition, the official scores are higher than
addition, our post-processing step also helps adjust text spans of      us. The average precision, recall, F1 score are at about 89.21%,
entities, such as adding a missing closing parenthesis.                 66.41%, 76.11% respectively1.
But we found some false cases in our results:                            In addition to our system own reasons, some factors that may
                                                                        affect the results. The research using paper corpus often do
Such as in the file EP1481667_0004, the entity "dopamine                experiments with the title, abstract and keywords of paper and it
receptor" occurs two times but annotated once. In our opinion, it       has less noise data. However, we use the patent corpus with full
violated the first rule in annotation guideline in paper [8]: When      text. Patents are focused on the protection of intellectual property
an entity is nested or has an overlap with another entity, the          rights but papers on the knowledge dissemination and sharing. In
entity should be annotated as more specific and informative.            order to protect the intellectual property rights and innovation,
And in US20050222261_0004, "ACE inhibitors" was annotated               patent documents will write in a special way. On the contrary,
as two entities. But in WO2004000294_0004, it was regarded as           the author can choose the way that readers make it easier to
the only one. Some entities like "AMcAMP", "IcAMP"                      understand in the paper.
(Abbreviation), "amino acids", "agonist", "methane sulfonic acid"
were not annotated in some document. "BMS- 204352" and                  5. CONCLUSIONS
"methyl testosterone" was not annotated in EP1481667_0004,              We develop a chemical and biological entity recognition system
but our system recognizes it as an entity. These cases would            and use the annotated chemical patent corpus to do the
influence the results to some degree.                                   experiment with the system. In our recognition system, we regard
                                                                        it as a sequence labeling problem instead of extracting the whole
                                                                        entity at once. We utilize some open-source NLP toolkits, such as
4. EXPERIMENTS                                                          OpenNLP, Stanford CoreNLP, and do some modification to
The patent corpus is available in 3 different sets: 1-
                                                                        appropriate for the patent corpus with some additional rules. In
Harmonized_set; 2-Full_set; 3-Training_set. We analyzed the
                                                                        our system, CRF++ is adopted for the actual implementation to
training and harmonized dataset, and found some nested entities
                                                                        process the sequence label problem. However, the results are not
in the harmonized set as discussed in section 3.1. Since CRF++
                                                                        so good as we expect. As it shows in Table 4, we get too much
cannot identify the nested entities, we just omit the less spanned
                                                                        FP results and nothing in FN. Maybe the entities annotated in
entities. Then, we insert the original text of patents and the
                                                                        one patent but not annotated in another one influence the
annotated entities into the mysql database to do the experiments.
                                                                        experiment results. We will define some suitable rules to
Each document is saved as a record in database, the sentences           improve the recognition system in the future.
split by space " ". Each term is stored in another table with the
classes, offsets, fileID and so on.
                                                                                        Table 4 Performance results in our system
The dataset is split for the 10-cross validation, and the training                        for the gold standard patent corpus2.
set. Each round contains about 12,000 sentences and 500,000
                                                                                             Run 1    Run 2    Run 3    Run 4       Run 5
features.
                                                                                                 1        1        0        1
                                                                            best cost        2        2        2        2           21
In CRF++, there are 4 major parameters ("-a", "-c", "-f" and "-p")
to control the training condition. CRF++ uses the features "-f" as          TP               28981    29655    29473    29502       29451
the cut-off threshold features, that occurs no less than NUM                TN               10517    15262    15626    15568       15668
times in the given training data. "-p" is the number of threads. In
our submitted predictions, the parameters: "-a", "-f" and "-p" are          FP               16607    11131    10790    11027       10875
set to default (CRF-L2), 2 and 4, respectively. The option "-c"             FN               0        0        0        0           0
trades the balance between over-fitting and under-fitting. The
predicted results will significantly be influenced by this                  Precision        63.57    72.71    73.20    72.79       73.03
parameter. It is better to find an optimal value by cross validation.       (%)
We just set "-c" option to {2−2 ,2−1 ,20 ,21,22 } due to the                Recall (%)       73.37    66.02    65.35    65.46       65.27
constraints of experimental time. Our submitted 5 runs
                                                                            F1     score     68.12    69.20    69.05    68.93       68.94
corresponds to different values of "-c" option.
                                                                            (%)
And we use brown clustering [14] to improve the recognition’s
effect. Brown clustering is an agglomerative, bottom-up form of
clustering that groups words into a binary tree of classes, using a
merging criterion based on the log-probability of a text under a        6. ACKNOWLEDGMENTS
class-based language model. Our system uses the cluster                 This work was supported by the Natural Science Foundation of
memberships of words resulting from Brown clustering as                 China: Research on Technology Opportunity Detection based on
features of each entity. At last, we run for 5 times in different
ways: without brown clustering, 500 clusters, 1,000 clusters,
1,500 clusters, 2,000 clusters. Experiments with brown clusters         1
                                                                            The experiment data using the official dataset is available at
have one more feature than "without brown clusters" in CRF++                website: http://www.sciteminer.org/XuShuo/Demo/CEM .
template file "brown tokens".                                           2
                                                                            Run 1 is the experiment without brown clusters. The other four
However, our results are not so good as we expect (Table 4). In             runs are respectively brown clustering’s number of 500, 1,000,
the analogous experiment, the entity subtask in the BioCreative             1,500, 2,000 clusters.
Paper and Patent Information Resources under grant number                Mining and Its Applications, IPaMin 2014, Co-located with
71403255, and Key Work Project of Institute of Scientific and            Konvens 2014, October 6, 2014 - October 7, 2014 (2014).
Technical Information of China (ISTIC): Research and                [7] Roman Klinger, Corinna Kolarik, Juliane Fluck, Martin
Development on Knowledge Organization System and Intelligent            Hofmann-Apitius, and Christoph M. Friedrich, 2008.
Analysis Service Demonstration Platform for Science and                 Detection of IUPAC and IUPAC-Like Chemical Names.
Technology Literature in New Material Domain under grant                Bioinformatics, Vol. 24, No. 13, pp. i268-i276.
number ZD2014-7-7.
                                                                    [8] Akhondi, S.A. et al. 2014. Annotated Chemical Patent
                                                                        Corpus: A Gold Standard for Text Mining. PLoS ONE. 9, 9
7. REFERENCES                                                           (2014), e107477. DOI: 10.1371/journal.pone.0107477
[1] Muresan S, Petrov P, Southan C, Kjellberg MJ, Kogej T, et
    al. (2011) Making every SAR point count: the development        [9] Zimmermann, M. et al. 2005. Information Extraction in the
    of Chemistry Connect for the large-scale integration of             Life Sciences: Perspectives for Medicinal Chemistry,
    structure and bioactivity data. Drug Discov Today 16: 1019–         Pharmacology and Toxicology. Current Topics in Medicinal
    1030.                                                               Chemistry. 5, 8 (Aug. 2005), 785–796.
[2] Southan C, Boppana K, Jagarlapudi SA, Muresan S (2011)          [10] Hobbs J R. The generic information extraction
    Analysis of in vitrobioactivity data extracted from drug             system[C]//MUC. 1993: 87-91.
    discovery literature and patents: Ranking 1654 human            [11] Xu S, An X, Zhu L, et al. A CRF-based system for
    protein targets by assayed compounds and molecular                   recognizing chemical entity mentions (CEMs) in biomedical
    scaffolds. J Cheminform 3: 14.                                       literature[J]. Journal of Cheminformatics, 2015 (Suppl 1):
[3] De Ridder, D. et al. 2013. Pattern recognition in                    S11.
    bioinformatics. Briefings in Bioinformatics. 14, 5 (Sep.        [12] Wei, C.H., Harris, B.R., Kao, H.Y., Lu, Z.: tmVar: A text
    2013), 633–647.                                                      mining approach for extracting sequence variants in
[4] Grego, T. et al. 2009. Identification of Chemical Entities in        biomedical literature. Bioinformatics 129(11) (2013) 1433–
    Patent Documents. Distributed Computing, Artificial                  1439
    Intelligence, Bioinformatics, Soft Computing, and Ambient       [13] Lafferty, J., McCallum, A., Pereira, F.: Conditional random
    Assisted Living, Pt Ii, Proceedings. S. Omatu et al., eds.           fields: Probabilistic models for segmenting and labeling
    Springer-Verlag Berlin. 942–949.                                     sequence data. In: ICML’01. (2001) 282–289
[5] Campos, D. et al. 2013. Gimli: open source and high-            [14] Turian, J., Ratinov, L., & Bengio, Y. (2010, July). Word
    performance biomedical name recognition. Bmc                         representations: a simple and general method for semi-
    Bioinformatics. 14, (Feb. 2013), 54.                                 supervised learning. In Proceedings of the 48th annual
[6] Han, H. et al. 2014. Mining technical topic networks from            meeting of the association for computational linguistics (pp.
    Chinese patents. 1st International Workshop on Patent                384-394). Association for Computational Linguistics.