Chemical and Biological Entity Recognition System from Patent Documents Hongchang Lai Shuo Xu Lijun Zhu Information Technology Information Technology Information Technology Supporting Center, Supporting Center, Supporting Center, Institute of Scientific and Technical Institute of Scientific and Institute of Scientific and Information of China Technical Information of China Technical Information of China No. 15 Fuxing Rd,.Haidian Distirct, No. 15 Fuxing Rd,.Haidian Distirct, No. 15 Fuxing Rd,.Haidian Distirct, Beijing 100038, P.R. China Beijing 100038, P.R. China Beijing 100038, P.R. China +86 10 5888 2447 +86 10 5888 2447 +86 10 5888 2447 laihc2013@istic.ac.cn xush@istic.ac.cn zhulj@istic.ac.cn ABSTRACT situation will be improved continually, since there is an increasing interest on patent mining, such as It is crucial to explore the chemical and biological space covered BIOINFORMATICS [3], BioCreative [4], JNLPBA [5] and by patent documents. In order to recognize chemical and iPaMin [6]. biological entities, a recognition system is developed on the basis An Annotated Chemical Patent Corpus [8] was published by of open-source machine learning and natural language processing Akhondi, which enables the development of the chemical and (NLP) toolkits. The system processing pipeline consists of three biological entity recognition system. Even so, it is still a rather major components: pre-processing (sentence detection, challenging task to automatically recognize chemical and tokenization), recognition (conditional random field (CRF) based biological entities from non-structural documents, especially approach), and post-processing (rule-based approach). The paper patents [7], since patents are complex legal documents that even introduces each part in detail. Finally, extensive experiments on contain up to hundreds of pages. annotated chemical patent corpus are conducted, and the balanced-F measure is 69.20% with 10-fold cross validation. The In this paper we explore the chemical and biological entity results indicate that the performance on patent documents is recognition system from patent documents using similar slightly lower than that of counterpart on paper and news corpus. approaches in [11]. Thus, one can see whether it is feasible by just borrowing some methods. The organization of the rest of the article is as follows. Section 2 summarized the overview of the Keywords annotated patent corpus. Section 3 introduces the recognition Conditional Random Field (CRF); Chemical and Biological system and the methods we used. Section 4 decrypts the Entity Recognition; Patent Mining; Cross Validation annotated corpus we used and some information of our experiments. 1. INTRODUCTION It is crucial to explore the chemical and biological space covered 2. DATASETS OVERVIEW by patent documents. For example, it can help speed-up the Akhondi et al have produced gold standard chemical patent early-stage medicinal chemistry activities [1] [2]. Though patent corpus of which 47 patents have been annotated by at least three documents contain many valuable chemical and biological annotators. The full-text patents and annotated entities are entities, such as chemical compounds, genes, proteins, drug and publicly accessible at www.biosemantics.org. so on, automatic recognition systems from patent documents are still very limited. We analyzed the training and harmonized dataset and found some nested chemical and biological entities in the harmonized However, as for paper and news documents, many identification set. In our system, CRF++ is adopted for the actual approaches are proposed and resulting systems are also implementation to process the sequence label problem. Since developed. CRF++ cannot identify the nested entities, we just omit the less In our opinion, the reasons are two-fold: (a) the annotated patent spanned entities. .There are 1239 entities of the type corpus are not available to public; (b) the patents are complex "OCRERRORSPELL" and "OCRERRORLINE" in the original legal documents which are very difficult to understand. But the annotation corpus, however some of them are nested. Finally, we Copyright © 2015 for the individual papers by the papers' authors.Copying reduced the entities amount from 37,776 down to 37,288, permitted for private and academic purposes.This volume is published and removing 488 nested entities. The harmonized set was produced copyrighted by its editors. from the 47 common patents, including a total of fourteen classes, Published at Ceur-ws.org 9857 unique terms and 37,288 annotated terms (see table 1). Proceedings of the Second International Workshop on Patent Mining and its Applications (IPAMIN). May 27–28, 2015, Beijing, China. Sentences Annotated Detection CRF-based Rule-Based Chemical Patent Results Approach Approach Corpus Tokenization Figure 1 The simplified system processing modules. Pipeline includes three major components: pre-processing (sentence detection, tokenization), recognition (CRF-based approach) and post-processing (rule-based approach) The results indicate that IUPAC (International Union of Pure and The process as showed in Figure 1, our system looks like a Applied Chemistry) entities and generic names have been serialized pipeline consisted by three major components. At first, annotated obviously more than any other chemical type. On the annotated chemical patents would be detected sentence other hand, InChI (International Chemical Identifier), CAS boundary. The sentences would be split by tabs ("\t"). And then, (Chemical Abstracts Service) registry numbers and SMILES each detected sentenced is tokenized as many tokens one by one. (Simplified molecular input line entry specification) appear Secondly, chemical and biological entity is extracted from corpus rarely in the chemical patents. Since we removed one of the label with a CRF-based approach. A 10-fold cross validation method is tag of entities which have two or more tags, the count of results adopted in order to evaluate the effect of our recognition system. would be a little bit different with [8]. Finally, some post-processing steps include a rule-based approach. Each step would be outlined in details in the following subsections. Table 1 Number of annotated terms and unique terms in the harmonized set of the gold standard corpus after removing 3.1 Pre-process: sentence detection, and nested entities tokenization Description Annotated Unique There are two kinds of document for each patent in annotated Terms Terms chemical patent corpus, the original patents and the entity M IUPAC 13943 4592 annotated for each part. In corpus, each patent was divided into several different partitions. Each partition contains different parts I SMILES 20 20 of the patent document. Generally, each subdocument is irregular Y InChi 0 0 in each line which is a sentence or not. For example, in the D Trademark 2355 897 document named US4659716_0001 of the training_set, some lines are the metadata features of patent such as the title, B Abbreviation 2087 146 abstract, inventors and so on; some lines have two or more C CAS number 6 5 sentences. F Formula 1115 160 In the system, the openNLP sentence detector toolkit is utilized. R Registry Number 140 95 Detecting sentence boundary is a challenging work by the reason of the ambiguous punctuation marks. For the further performance G Generic 8381 811 of the sentence boundary detection, we gathered the many T Target 3221 654 abbreviations sets of the corpus in advance, such as var., e.g., sp. Disease Disease 3765 1205 Especially in annotated documents, such as the entity contains the full point marks, for example "EC 3.4.24.11" or MOA MOA 1016 197 "MgCl2.6H2O" etc. Then we generated several rules, for OCRERROR- Spelling error 1189 1029 instance if current sentence ends with these abbreviations or SPELL comma, the current and subsequent sentences are merged into a OCRERROR- Spurious line 50 46 new one. And the metadata features in patent as mentioned LINE break before, each line is regarded as a sentence because the metadata features are shorter than other sentences and have less Total 37288 9857 information about the entity. In the end, all the sentences were combined into a bulky document. Each line of the document is a subpart of the patent. 3. SYSTEMS DESCRIPTION AND The line format is as follows: METHODS fileID sentence. sentence. Based on the summary of the principal methods used in the MUC-4 (Mucin 4) systems, Hobbs proposed a generic Each line begins with the file id of the source of the sentence information extraction system [10] which consists of ten followed by one tab. Sentences split by space " ". modules. It is the theoretical basis based on a large amount of The tokenizer in the system is based on the OpenNLP toolkit. It practice for our system. On the other hand, we refer to the can divide the sentence above into some reasonable tokens what recognizing chemical entities system published in [11]. we need. However, it would get a poor result by using the (Broyden–Fletcher–Goldfarb–Shanno)) method to do the original tokenizer, which cannot be applied to sequence labeling unconstrained optimization for parameter estimation. On the problem. Then some improvement approaches are expected to be other hand, CRF++ use line search to compute the step size of adopted, and we get much better fine-grained tokens. Such as the unconstrained optimization problem. the entity " (S)-(-)-α,α-diphenyl-2-pyrrolidinemethanol" in The annotated entity in patent corpus can be classified into one US5650521_ 0003, the entity type is "M" which means "IUPAC". of the fourteen classes: Table 2 Examples of Chemical component entity labels 4-tag method is used to label the chemical entity with B I E O, token … ( S ) - ( - which means "beginning of the entity", "word in the entity", "end of the entity" and "the other words". And some nested annotated label O B I I I I I entities mentioned in section 3.1 are uniform to a same type, because the CRF++ cannot process the nested entities. token ) - α , α - dipheny Harmonized set merged by the annotations of the 47 patents annotated by more than three groups is used as the training set label I I I I I I -I with different entity types (chemicals and their sub entities, diseases, MOAs, and targets). pyrrolidinem … token - 2 - ethanol . 3.3 The features for CRF Our system exploits four different types of features: label I I I E O General linguistic features. Our system includes the original tokens, as well as stemmed tokens, as features using the Porter’s stemmer from Stanford CoreNLP. As shown in Table 2, the punctuation marks (brackets, dashes, Characteristic features. Since many entities contains numbers, etc.), Greek symbols, numbers are regarded as the isolate tokens. Greek letters, Roman numbers, amino acids, chemical elements, In the annotation documents, the type "OCRERRORSPELL" and and special characters, our system calculates several statistics as "OCRERRORLINE" are marked in the end of each document. features for each token, including its number of digitals, number Meanwhile, the entities of these two types also have the right of upper- and lower-case letters, number of all characters and entity types. Such as in the US20050222261_0003: presence or absence of specific characters or Greek letters, T109 D 4726 4738 siruvastatin Roman numbers, amino acids, or chemical elements. T343 OCRERRORSPELL 4726 4738 siruvastatin Case pattern features. Similar to [12], the upper case alphabetic character, the lower case one and any number (0-9) are replaced However, some of OCRERRORSPELL entities have only one by ’A’, ’a’, ’0’ respectively. Moreover, our system also merges type. It means some of them are nested in entity types, but others consecutive letters and numbers and generated additional single have a unique type label. For consistency, the uniform type labels letter ’a’ and number ’0’ features. are given for each entity to get rid of nested types. There are 1239 entities of the type "OCRERRORSPELL" and Contextual features. For each token, our system includes the "OCRERRORLINE" in the original annotation corpus, however linguistic features of two neighboring tokens from each side. some of them are nested. Finally, we reduce the entities amount There is an example of the entities features: from 37,776 down to 37,288, removing 488 nested entities. 3.2 Recognition: crf-based approach Table 3 An example for entity features As mentioned above, the chemical and biological entity Stemmer Amino Acid Element Symbol recognition problem is treated as a sequence label problem (Table 2). Conditional random fields, as a framework for Lymphocyte true true false building probabilistic models to segment and label sequence Roman Num Of Num Of Num Of data[13], avoids a fundamental limitation of MEMMs (maximum Digitals Upper Case Lower Case entropy Markov models) and other discriminative Markov Letters Letters models based on directed graphical models, and offers several advantages over hidden Markov models and stochastic grammars. False 0 0 11 CRF can pick up the context into account; e.g., the linear chain Length case Pattern brown label tag CRF in natural language processing predicts sequences of labels for sequences of input samples. There are observations and random variables , the random variables are conditioned on 3.4 Post-processing: rule-based approach . the conditional distribution is then modeled. Due to On closer examination, we find that the results of CRF approach some polynomial equations easily computed by Newton’s method, include some false positive chemical and biological entities. So, the CRF++ adopts the L-BFGS (Limited-memory BFGS we developed several additional rules to remove them. In IV CHEMDNER competition, the official scores are higher than addition, our post-processing step also helps adjust text spans of us. The average precision, recall, F1 score are at about 89.21%, entities, such as adding a missing closing parenthesis. 66.41%, 76.11% respectively1. But we found some false cases in our results: In addition to our system own reasons, some factors that may affect the results. The research using paper corpus often do Such as in the file EP1481667_0004, the entity "dopamine experiments with the title, abstract and keywords of paper and it receptor" occurs two times but annotated once. In our opinion, it has less noise data. However, we use the patent corpus with full violated the first rule in annotation guideline in paper [8]: When text. Patents are focused on the protection of intellectual property an entity is nested or has an overlap with another entity, the rights but papers on the knowledge dissemination and sharing. In entity should be annotated as more specific and informative. order to protect the intellectual property rights and innovation, And in US20050222261_0004, "ACE inhibitors" was annotated patent documents will write in a special way. On the contrary, as two entities. But in WO2004000294_0004, it was regarded as the author can choose the way that readers make it easier to the only one. Some entities like "AMcAMP", "IcAMP" understand in the paper. (Abbreviation), "amino acids", "agonist", "methane sulfonic acid" were not annotated in some document. "BMS- 204352" and 5. CONCLUSIONS "methyl testosterone" was not annotated in EP1481667_0004, We develop a chemical and biological entity recognition system but our system recognizes it as an entity. These cases would and use the annotated chemical patent corpus to do the influence the results to some degree. experiment with the system. In our recognition system, we regard it as a sequence labeling problem instead of extracting the whole entity at once. We utilize some open-source NLP toolkits, such as 4. EXPERIMENTS OpenNLP, Stanford CoreNLP, and do some modification to The patent corpus is available in 3 different sets: 1- appropriate for the patent corpus with some additional rules. In Harmonized_set; 2-Full_set; 3-Training_set. We analyzed the our system, CRF++ is adopted for the actual implementation to training and harmonized dataset, and found some nested entities process the sequence label problem. However, the results are not in the harmonized set as discussed in section 3.1. Since CRF++ so good as we expect. As it shows in Table 4, we get too much cannot identify the nested entities, we just omit the less spanned FP results and nothing in FN. Maybe the entities annotated in entities. Then, we insert the original text of patents and the one patent but not annotated in another one influence the annotated entities into the mysql database to do the experiments. experiment results. We will define some suitable rules to Each document is saved as a record in database, the sentences improve the recognition system in the future. split by space " ". Each term is stored in another table with the classes, offsets, fileID and so on. Table 4 Performance results in our system The dataset is split for the 10-cross validation, and the training for the gold standard patent corpus2. set. Each round contains about 12,000 sentences and 500,000 Run 1 Run 2 Run 3 Run 4 Run 5 features. 1 1 0 1 best cost 2 2 2 2 21 In CRF++, there are 4 major parameters ("-a", "-c", "-f" and "-p") to control the training condition. CRF++ uses the features "-f" as TP 28981 29655 29473 29502 29451 the cut-off threshold features, that occurs no less than NUM TN 10517 15262 15626 15568 15668 times in the given training data. "-p" is the number of threads. In our submitted predictions, the parameters: "-a", "-f" and "-p" are FP 16607 11131 10790 11027 10875 set to default (CRF-L2), 2 and 4, respectively. The option "-c" FN 0 0 0 0 0 trades the balance between over-fitting and under-fitting. The predicted results will significantly be influenced by this Precision 63.57 72.71 73.20 72.79 73.03 parameter. It is better to find an optimal value by cross validation. (%) We just set "-c" option to {2−2 ,2−1 ,20 ,21,22 } due to the Recall (%) 73.37 66.02 65.35 65.46 65.27 constraints of experimental time. Our submitted 5 runs F1 score 68.12 69.20 69.05 68.93 68.94 corresponds to different values of "-c" option. (%) And we use brown clustering [14] to improve the recognition’s effect. Brown clustering is an agglomerative, bottom-up form of clustering that groups words into a binary tree of classes, using a merging criterion based on the log-probability of a text under a 6. ACKNOWLEDGMENTS class-based language model. Our system uses the cluster This work was supported by the Natural Science Foundation of memberships of words resulting from Brown clustering as China: Research on Technology Opportunity Detection based on features of each entity. At last, we run for 5 times in different ways: without brown clustering, 500 clusters, 1,000 clusters, 1,500 clusters, 2,000 clusters. Experiments with brown clusters 1 The experiment data using the official dataset is available at have one more feature than "without brown clusters" in CRF++ website: http://www.sciteminer.org/XuShuo/Demo/CEM . template file "brown tokens". 2 Run 1 is the experiment without brown clusters. The other four However, our results are not so good as we expect (Table 4). In runs are respectively brown clustering’s number of 500, 1,000, the analogous experiment, the entity subtask in the BioCreative 1,500, 2,000 clusters. Paper and Patent Information Resources under grant number Mining and Its Applications, IPaMin 2014, Co-located with 71403255, and Key Work Project of Institute of Scientific and Konvens 2014, October 6, 2014 - October 7, 2014 (2014). Technical Information of China (ISTIC): Research and [7] Roman Klinger, Corinna Kolarik, Juliane Fluck, Martin Development on Knowledge Organization System and Intelligent Hofmann-Apitius, and Christoph M. Friedrich, 2008. Analysis Service Demonstration Platform for Science and Detection of IUPAC and IUPAC-Like Chemical Names. Technology Literature in New Material Domain under grant Bioinformatics, Vol. 24, No. 13, pp. i268-i276. number ZD2014-7-7. [8] Akhondi, S.A. et al. 2014. Annotated Chemical Patent Corpus: A Gold Standard for Text Mining. PLoS ONE. 9, 9 7. REFERENCES (2014), e107477. DOI: 10.1371/journal.pone.0107477 [1] Muresan S, Petrov P, Southan C, Kjellberg MJ, Kogej T, et al. (2011) Making every SAR point count: the development [9] Zimmermann, M. et al. 2005. Information Extraction in the of Chemistry Connect for the large-scale integration of Life Sciences: Perspectives for Medicinal Chemistry, structure and bioactivity data. Drug Discov Today 16: 1019– Pharmacology and Toxicology. Current Topics in Medicinal 1030. Chemistry. 5, 8 (Aug. 2005), 785–796. [2] Southan C, Boppana K, Jagarlapudi SA, Muresan S (2011) [10] Hobbs J R. The generic information extraction Analysis of in vitrobioactivity data extracted from drug system[C]//MUC. 1993: 87-91. discovery literature and patents: Ranking 1654 human [11] Xu S, An X, Zhu L, et al. A CRF-based system for protein targets by assayed compounds and molecular recognizing chemical entity mentions (CEMs) in biomedical scaffolds. J Cheminform 3: 14. literature[J]. Journal of Cheminformatics, 2015 (Suppl 1): [3] De Ridder, D. et al. 2013. Pattern recognition in S11. bioinformatics. Briefings in Bioinformatics. 14, 5 (Sep. [12] Wei, C.H., Harris, B.R., Kao, H.Y., Lu, Z.: tmVar: A text 2013), 633–647. mining approach for extracting sequence variants in [4] Grego, T. et al. 2009. Identification of Chemical Entities in biomedical literature. Bioinformatics 129(11) (2013) 1433– Patent Documents. Distributed Computing, Artificial 1439 Intelligence, Bioinformatics, Soft Computing, and Ambient [13] Lafferty, J., McCallum, A., Pereira, F.: Conditional random Assisted Living, Pt Ii, Proceedings. S. Omatu et al., eds. fields: Probabilistic models for segmenting and labeling Springer-Verlag Berlin. 942–949. sequence data. In: ICML’01. (2001) 282–289 [5] Campos, D. et al. 2013. Gimli: open source and high- [14] Turian, J., Ratinov, L., & Bengio, Y. (2010, July). Word performance biomedical name recognition. Bmc representations: a simple and general method for semi- Bioinformatics. 14, (Feb. 2013), 54. supervised learning. In Proceedings of the 48th annual [6] Han, H. et al. 2014. Mining technical topic networks from meeting of the association for computational linguistics (pp. Chinese patents. 1st International Workshop on Patent 384-394). Association for Computational Linguistics.