VinAI at ChEMU 2020: An accurate system for named entity recognition in chemical reactions from patents Mai Hoang Dao1,2 and Dat Quoc Nguyen1 1 VinAI Research, Vietnam {v.maidh3, v.datnq9}@vinai.io 2 Posts and Telecommunications Institute of Technology, Vietnam Abstract. This paper describes our VinAI system for the ChEMU task 1 of named entity recognition (NER) in chemical reactions. Our system employs a BiLSTM-CNN-CRF architecture [6] with additional contex- tualized word embeddings. It achieves very high performance, officially ranking second with regards to both exact- and relaxed-match F1 scores at 94.33% and 96.84%, respectively. In a post-evaluation phase, fixing a mapping bug which converts the column-based format into the brat standoff format helps our system to obtain higher results. In particular, we obtain an exact-match F1 score at 95.21% and especially a relaxed- match F1 score at 97.26%, thus achieving the highest relaxed-match F1 compared to all other participating systems. We believe our system can serve as a strong baseline for future research and downstream applica- tions of chemical NER over chemical reactions from patents. Keywords: Named entity recognition; Chemical reactions; Patents; Neu- ral network. 1 Introduction The discovery of new chemical compounds plays an essential key role in the chemical industry. To disclose newly discovered chemical compounds, patent documents are often selected as the initial venues; and only a small fraction of these chemical compounds are published in journals, but this usually takes up to 3 years after the patent disclosure [14]. Thus patents containing critical and timely information about the new chemical compounds serve as starting pointers for chemical research in both academia and industry [1]. Due to a huge volume of new chemical patent applications [9], it is becoming increasingly important to develop automatic information extraction approaches for large-scale mining of chemical information from these patent documents [1]. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. Chemical named-entity recognition (NER) is a fundamental step for informa- tion extraction from chemical patents, supporting many downstream tasks such as chemical reaction prediction [12,17], chemical syntheses [13] and the like. The ChEMU—Cheminformatics Elsevier Melbourne University—task 1 provides par- ticipants with opportunities to develop automatic chemical NER systems from chemical reactions in chemical patents. This task is to identify crucial elements of a chemical reaction, including compounds, conditions and yields as well as their specific roles in the reaction. Details of this task can be found in the overview paper of the ChEMU lab [3]. In this paper, we present our VinAI team’s system for the ChEMU task 1. Our system is based on the well-known BiLSTM-CNN-CRF architecture [6] with additional contextualized word embeddings. Our system officially obtains the second best performance results in terms of both exact- and relaxed-match F1 scores at 94.33% and 96.84%, respectively. In a post-evaluation phase, fix- ing a column-brat conversion bug then helps our system to obtain even better results at 95.21% for exact-match F1 and especially 97.26% for relaxed-match F1 . We thus obtain the highest relaxed-match F1 score in comparison to all other participating systems. We also provide an ablation study to investigate the contributions of different types of input word representations in the full sys- tem, reconfirming the effectiveness of the contextualized word embeddings for chemical NER [18]. 2 Task description The ChEMU task 1 of “Named entity recognition” involves identifying chemical compounds and their specific types. In particular, the task assigns the label of a chemical compound according to the role which it plays within a chemical reaction. In addition to identifying chemical compounds, the task also requires identification of the label of the chemical reaction, the temperatures and reaction times at which the reaction is carried out as well as yields obtained for the final chemical product. The task defines 10 different entity type labels as listed in Table 1, involving both entity boundary prediction and entity label classification. See [3,10] for more details. 3 Our system In this section, we present our VinAI system for the ChEMU task 1. We formulate this task as a sequence labeling problem with BIO tagging scheme. Following [18], our system employs the well-known BiLSTM-CNN-CRF model [6] with additional contextualized word embeddings. Figure 1 illustrates the architecture of our participating system. In particular, our system represents each word token wi in an input sequence w1 , w2 , ..., wn by a vector v i which is resulted by concatenating the pre-trained word embedding, the CNN-based character-level word embedding [6] and the contextualized word Table 1. Definitions of entity types. “Abbr.” denotes abbreviation. Label Definition Abbr. STARTING MATERIAL A substance that is consumed in the course of a chem- ST ical reaction providing atoms to products. REAGENT CATALYST A reagent is a compound added to a system to cause RC or help with a chemical reaction. SOLVENT A solvent is a chemical entity that dissolves a solute SO resulting in a solution. REACTION PRODUCT A product is a substance that is formed during a chem- RP ical reaction. OTHER COMPOUND Other chemical compounds that are not the products, OT starting materials, reagents, catalysts and solvents. TIME The reaction time of the reaction. TI TEMPERATURE The temperature at which the reaction was carried out. TE YIELD PERCENT Yield given in percent values. YP YIELD OTHER Yields provided in other units than %. YO EXAMPLE LABEL A label associated with a reaction specification. EX O B-REAGENT CATALYST I-REAGENT CATALYST CRF CRF CRF h1 h2 h3 Linear Linear Linear Layer Layer Layer r1 r2 r3 BiLSTM BiLSTM BiLSTM v1 v2 v3 ⊕ ⊕ ⊕ CNN Word ELMo CNN Word ELMo CNN Word ELMo char. emb. emb. char. emb. emb. char. emb. emb. addedw1 sulfuricw2 acidw3 Fig. 1. Illustration of our participating system’s architecture. This figure is drawn based on [18]. embedding of the word token wi . Here, we utilize the pre-trained word embed- dings released by [18], which are trained on a corpus of 84K chemical patents (1B word tokens) using the Word2Vec skip-gram model [7]. In addition, we also utilize the contextualized word embeddings generated by a pre-trained ELMo language model [11], which is trained using the same corpus of 84K chemical patents [18].1 Then vector representations v i are fed into a BiLSTM encoder to extract latent feature vectors r i for input words wi . Each latent feature vec- tor r i is then linearly transformed into hi before being fed into a linear-chain CRF layer for NER label prediction [5]. A cross-entropy loss is computed during training while the Viterbi algorithm is used for decoding. 4 Experiments 4.1 Experimental setup Dataset: For system development, the ChEMU task 1 provides a corpus of 1125 chemical reaction snippets with gold standard NER annotations using the brat standoff format [15]. Although this corpus is pre-split into a training set of 900 snippets and a validation set of 225 snippets, participants are free to use this corpus in any manner they find useful when training and tuning their systems, e.g. using a different split or performing cross-validation. Thus we only employ the first 100 snippets in the provided validation set for validation,2 and merge the remaining 125 snippets into the provided training set, resulting in a new training set of 1025 snippets in total. Following [18], we employ the OpenNLP toolkit [8] for sentence segmentation and the OSCAR4 tokenizer [4] to tokenize training and validation sentences, then convert these sentences into the CoNLL column-based format with the BIO tagging scheme. Implementation: Our system is implemented based on the AllenNLP frame- work [2]. For training, we use exactly the same hyper-parameters used in [18] with the exception of using the batch size at 24. Pre-trained word embeddings and the pre-trained ELMo are fixed while other model parameters are updated during training. We train our system for 50 epochs and compute the standard exact-match F1 score after each training epoch on the validation set. We select the model with the highest exact-match F1 score on the validation set. Evaluation phase: For the final evaluation phase, the ChEMU task 1 provides a raw test set consisting of 375 patent snippets. Each test snippet is sentence- segmented and tokenized using OpenNLP and OSCAR4, respectively. We then convert tokenized test sentences into the column-based format and apply our selected model to predict NER labels. We then use our own mapping script to convert the predicted BIO-based NER outputs into the brat standoff format, and submit the brat-formatted test outputs for evaluation. Evaluation metrics: The ChEMU task 1 uses three metrics, namely preci- sion, recall and F1 scores for evaluation, under both “exact” and “relaxed” span matching conditions [16]. 1 https://github.com/zenanz/ChemPatentEmbeddings 2 Sorted by file names: 0050–0690. Table 2. Our official evaluation results (in %) on the test set, i.e. the predicted test outputs are submitted during the evaluation phase. The subscripts denote our ranking. Exact-match Relaxed-match Entity label P R F1 P R F1 STARTING MATERIAL 93.24 91.14 92.18 96.40 94.10 95.24 REAGENT CATALYST 88.54 90.48 89.50 91.84 94.04 92.93 SOLVENT 93.64 96.26 94.93 94.55 96.97 95.74 REACTION PRODUCT 89.12 90.99 90.05 94.85 97.18 96.00 OTHER COMPOUND 97.10 95.29 96.18 98.84 97.10 97.96 TIME 98.89 98.67 98.78 100.0 99.56 99.78 TEMPERATURE 95.54 94.44 94.99 99.01 98.68 98.84 YIELD PERCENT 99.74 99.74 99.74 99.74 99.74 99.74 YIELD OTHER 97.68 95.68 96.67 97.91 95.91 96.90 EXAMPLE LABEL 91.10 87.97 89.50 94.07 90.83 92.42 Overall 94.622 94.053 94.332 97.071 96.613 96.842 Table 3. Our post-evaluation results (in %) on the test set, i.e. the predicted test outputs, which are resulted from fixing the column-brat conversion bug, are submitted right after the evaluation phase. Exact-match Relaxed-match Entity label P R F1 P R F1 STARTING MATERIAL 93.56 91.98 92.77 96.71 94.94 95.82 REAGENT CATALYST 90.47 92.26 91.36 92.22 94.05 93.12 SOLVENT 94.09 96.73 95.39 94.55 97.20 95.85 REACTION PRODUCT 90.68 92.16 91.42 95.51 96.96 96.23 OTHER COMPOUND 97.05 95.44 96.24 98.68 97.30 97.99 TIME 99.12 99.12 99.12 100.0 99.78 99.89 TEMPERATURE 95.70 94.44 95.00 99.01 99.01 99.01 YIELD PERCENT 99.49 100.0 99.74 99.49 100.0 99.74 YIELD OTHER 97.94 97.05 97.49 98.17 97.27 97.72 EXAMPLE LABEL 97.38 95.70 96.53 97.67 95.99 96.82 Overall 95.382 95.042 95.212 97.371 97.161 97.261 4.2 Main results Table 2 shows the official results of our system’s outputs on the test set which is submitted during the evaluation phase. By employing a standard neural archi- tecture, our system obtains a high performance which is officially ranked second among 11 participating systems, using both exact- and relaxed-match F1 scores. Table 4. Ablation “exact-match” results (in %) on the development set. Model P R F1 Our system (full) 97.41 97.07 97.24 (a) w/o Word2Vec-based pre-trained word embeddings 96.32 96.66 96.49 (b) w/o CNN-based character-level word embeddings 97.24 97.01 97.13 (c) w/o ELMo-based contextualized word embeddings 96.25 96.19 96.22 Note that in the evaluation phase, we unfortunately were unaware of a bug in our mapping script which converts the predicted test outputs in the column- based format into the brat standoff format. Right after the evaluation phase, we fixed the bug, and reran our column-brat conversion script to produce a new submission, and then asked the ChEMU organizers to help evaluate the new submission. Table 3 details our post-evaluation results. Fixing the mapping bug helps improve our exact-match F1 by 0.9% and our relaxed-match F1 by 0.4%, absolutely; thus leading to the highest relaxed-match F1 score compared to other participating systems. 4.3 Ablation study Table 4 presents ablation tests over 3 factors of our system on the develop- ment set, including (a) removing the Word2Vec-based pre-trained word em- beddings, (b) removing the CNN-based character-level word embeddings and (c) removing the ELMo-based contextualized word embeddings. Factor (a) de- grades the exact-match F1 score by 0.8%, while factor (b) and (c) degrade the exact-match F1 score by 0.1% and 1.0%, respectively. The contribution of the CNN-based character-level word embeddings is not substantial because the pre- trained ELMo language model we employ also builds on character embeddings. 4.4 Error analysis To understand the source of errors, we perform error analysis on the development set. Among 56 error cases in total, 34 cases are predicted with correct entity boundaries (i.e. exact span) but with incorrect labels (See the corresponding confusion matrix in Figure 2), while there are 17 cases corresponding with correct entity labels and overlapped inexact span. Figures 3 and 4 show examples of these two types of errors. In particular, Figure 3 shows an example of exact span and an incorrect label where a reagent catalyst entity of “HCL” is predicted as another compound type. The reason is probably because “HCL” and other popular chemical compounds such as “water”, “citric acid” and the like play different/multiple roles in chemical reactions. Note that there is no error case corresponding with incorrect label and overlapped inexact span. The remaining 5 errors belong to the group of predicted entities in which their span is not overlapped with the span of any gold standard entity, i.e. non-chemical “O”- labeled words are predicted as REACTION PRODUCT (RP) chemical entities as shown in the column O in Figure 2. Fig. 2. Confusion matrix of our system on the development set w.r.t. the correct entity boundary prediction. Label abbreviations are presented in Table 1. Fig. 3. An example of incorrect NER type prediction. Fig. 4. An example of incorrect NER span prediction. 5 Conclusion In this paper, we have presented our VinAI system for participating in the ChEMU task 1 of named entity recognition in chemical reactions from patents. We use a BiLSTM-CNN-CRF architecture with additional ELMo-based contex- tualized word embeddings to handle the task. Our system is officially ranked the second best performing one with regards to both the exact- and relaxed-match F1 scores. In addition, fixing the column-brat conversion bug then helps our system to obtain the highest relaxed-match F1 score in a post-evaluation phase. We believe our system can serve as a strong baseline for future work on chemical NER in chemical reactions from patents. References 1. Akhondi, S.A., Rey, H., Schwörer, M., Maier, M., Toomey, J.P., Nau, H., Ilch- mann, G., Sheehan, M., Irmer, M., Bobach, C., Doornenbal, M.A., Gregory, M., Kors, J.A.: Automatic identification of relevant chemical compounds from patents. Database 2019, baz001 (2019) 2. Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N.F., Peters, M., Schmitz, M., Zettlemoyer, L.S.: AllenNLP: A Deep Semantic Natural Language Processing Platform. In: arXiv:1803.07640 (2017) 3. He, J., Nguyen, D.Q., Akhondi, S.A., Druckenbrodt, C., Thorne, C., Hoessel, R., Afzal, Z., Zhai, Z., Fang, B., Yoshikawa, H., Albahem, A., Cavedon, L., Cohn, T., Baldwin, T., Verspoor, K.: Overview of ChEMU 2020: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020) (2020) 4. Jessop, D.M., Adams, S.E., Willighagen, E.L., Hawizy, L., Murray-Rust, P.: Os- car4: a flexible architecture for chemical text-mining. Journal of cheminformatics 3(1), 1–12 (2011) 5. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Prob- abilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the Eighteenth International Conference on Machine Learning. pp. 282–289 (2001) 6. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs- CRF. In: Proceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). pp. 1064–1074 (2016) 7. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sentations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013) 8. Morton, T., Kottmann, J., Baldridge, J., Bierner, G.: Opennlp: A java-based nlp toolkit. In: Proceeding of the 10th Conference of the European Chapter of the Association of Computational Linguistics (2005) 9. Muresan, S., Petrov, P., Southan, C., Kjellberg, M.J., Kogej, T., Tyrchan, C., Varkonyi, P., Xie, P.H.: Making every sar point count: the development of chem- istry connect for the large-scale integration of structure and bioactivity data. Drug Discovery Today 16(23-24), 1019–1030 (2011) 10. Nguyen, D.Q., Zhai, Z., Yoshikawa, H., Fang, B., Druckenbrodt, C., Thorne, C., Hoessel, R., Akhondi, S.A., Cohn, T., Baldwin, T., Verspoor, K.: ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In: Proceedings of the 42nd European Conference on Information Retrieval. pp. 572–579 (2020) 11. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227–2237 (2018) 12. Schwaller, P., Gaudin, T., Lanyi, D., Bekas, C., Laino, T.: “found in translation”: predicting outcomes of complex organic chemistry reactions using neural sequence- to-sequence models. Chemical science 9(28), 6091–6098 (2018) 13. Segler, M.H., Preuss, M., Waller, M.P.: Planning chemical syntheses with deep neural networks and symbolic ai. Nature 555(7698), 604 (2018) 14. Senger, S., Bartek, L., Papadatos, G., Gaulton, A.: Managing expectations: as- sessment of chemistry databases generated by automated extraction of chemical structures from patents. Journal of cheminformatics 7(1), 49 (2015) 15. Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: brat: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demon- strations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. pp. 102–107 (2012) 16. Verspoor, K., Jimeno Yepes, A., Cavedon, L., McIntosh, T., Herten-Crabb, A., Thomas, Z., Plazzer, J.P.: Annotating the biomedical literature for the human variome. Database 2013, bat019 (04 2013) 17. Yoshikawa, H., Nguyen, D.Q., Zhai, Z., Druckenbrodt, C., Thorne, C., Akhondi, S.A., Baldwin, T., Verspoor, K.: Detecting Chemical Reactions in Patents. In: Pro- ceedings of the 17th Annual Workshop of the Australasian Language Technology Association. pp. 100–110 (2019) 18. Zhai, Z., Nguyen, D.Q., Akhondi, S., Thorne, C., Druckenbrodt, C., Cohn, T., Gre- gory, M., Verspoor, K.: Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings. In: Proceedings of the 18th BioNLP Work- shop. pp. 328–338 (2019)