Introduction

BOUN-REX at CLEF-2020 ChEMU Task 2: Evaluating Pretrained Transformers for Event Extraction

Hilal Donmez

hilal.donmez@boun.edu.tr 1

Abdullatif Koksal

abdullatif.koksal@boun.edu.tr 1

Elif Ozkirimli

elif.ozkirimli@boun.edu.tr 0 2

Arzucan O zgur

arzucan.ozgur@boun.edu.tr 1 0 Department of Chemical Engineering, Bogazici University , Turkey 1 Department of Computer Engineering, Bogazici University , Turkey 2 Pharma International Informatics Data and Analytics Chapter , F. Ho mann-La Roche AG , Switzerland

In this paper, we describe our models and results designated for CLEF-2020 ChEMU Task 2 [4], event extraction for chemical patent documents. We make use of the recent advances in pretrained transformer architectures such as BERT and BioBERT. We compare several transformers with di erent settings in order to improve performance. Our best performing model with BioBERT transformer architecture and AdamW optimizer achieves 0.7234 exact F1 score on the test dataset.

Introduction

Chemical information in patents is an essential resource for researchers working on chemical exploration and reactions. As the number of patents grows rapidly, Natural Language Processing (NLP) approaches are widely used to extract chemical information from patents so as to reduce the time and e ort spent. Most previous studies on chemical information extraction focus on chemical named entity recognition (NER) [ 6 ] thanks to publicly available annotated corpora. On the other hand, there is a limited number of studies on chemical event extraction from patents.

Event extraction from patents contains detection of event trigger word, event trigger type, and event type. Figure 1 illustrates an example sentence of the event extraction task in the dataset released by Cheminformatics Elsevier Melbourne University (ChEMU). In this example, room temperature and 30 minutes are given as entities with their corresponding types: TEMPERATURE and TIME. After stirred is detected as a trigger word for both entities, two event types (both of type ARGM) are determined separately according to the relevant entity type.

In this work, we investigated the impact of various transformer architectures with di erent parameters on event extraction from patents by conducting several experiments. We also explored the e ects of the pretraining corpus of transformers by comparing BERT [ 2 ] and BioBERT [ 7 ]. Besides, we investigated the signi cance of di erent optimizers such as Adam, AdamW, and SGD for the netuning of transformers for this task. 2

Related Work

Determining the semantic relation between entities is an important scienti c problem in various domains such as biomedical text, digital text, and governmental documents. Recently, deep neural networks have been widely used to identify the relations between entities. Previous research studies that use deep learning for relation extraction make use of CNN [ 17 ] and RNN [20] models by taking sentence representations with word vectors such as Word2Vec [ 11 ] and GloVe [ 13 ] in order to extract features automatically instead of hand-crafted features [ 5 ]. Recent studies on relation extraction have been based on the transformer architecture [ 15 ] trained on large amounts of unlabeled data to improve the state-of-the-art on several natural language processing tasks. In [19], a pretrained transformer model is utilized to extract e cient relation representations from text.

In event extraction, earlier neural network models enhanced CNN [ 12 ] and RNN [18] with di erent kinds of word representations to determine the locations and types of trigger words. In addition, structured information bene ting from dependency trees [ 8 ] and knowledge bases [ 9 ] is exploited by neural networks to improve event extraction performance. Lately, pretrained transformer based models have gained popularity for event extraction. In [ 16 ], trigger and argument extractor models obtain feature representations using BERT, a pretrained transformer model. 3 3.1

Methodology and Data Data

We use the dataset released for the ChEMU tasks on information extraction from chemical patents. The dataset contains chemical patent documents with annotation les for training, development, and test sets. Entities with their types and relations between these entities are included in the annotation les. There are 10 di erent types of entity annotations for the Event Extraction Task in ChEMU. Table 1 shows the annotated types of entities in the dataset.

The event extraction problem focuses on event trigger word detection, trigger type detection, and event type prediction. Event trigger words whose types are REACTION STEP or WORKUP are identi ed and the chemical entity arguments of the events are determined. The relation between an argument and a trigger word is labeled as a semantic argument role label, which is Arg1 or ArgM. The relation between a trigger word and a temperature, time or yield entity is labeled as ArgM, whereas the relation between a trigger word and an entity having one of the other entity types is labeled as Arg1. Table 2 contains the statistics of the ChEMU Dataset. 4 We were able to use 713 out of the 900 documents in the train set due to a problem during the downloading process. 3.2

Preprocessing

Our preprocessing steps involve sentence splitting and adding entity markers. For simplicity, we consider the relations that are present in single sentences, and we split the documents into sentences via the GENIA Sentence Splitter [ 14 ]. For each entity in a sentence, we construct sentence-entity pairs and predict events and trigger words from these pairs. On the other hand, there are 121 entities that have relations with more than one trigger word in our training set. We ignore these kinds of entities for event trigger word detection.

We need to explicitly identify an entity to nd the corresponding relation and trigger word in a sentence. Therefore, we add speci c markers called <E> and </E> before and after the entities for the model to identify the entities by following the discussion in [ 1 ]. Moreover, we create di erent representations for each sentence having more than one entity by applying the marker method. Hence, the sentence representation is distinct for each entity in the same sentence having more than one entities. The following examples show that there are two di erent representations for hexanes and silica, which are located in the same sentence.

{ The solvent was removed in vacuo, and the crude product was puried by ash chromatography (silica, 100% <E> hexanes </E> to 9:1 hexanes/EtOAc) to give a pale-yellow viscous oil (3.83 g, 86%). { The solvent was removed in vacuo, and the crude product was puried by ash chromatography (<E> silica </E>, 100% hexanes to 9:1 hexanes/EtOAc) to give a pale-yellow viscous oil (3.83 g, 86%). 3.3

Model

Problem De nition: For a given sentence S with an entity et with type t, the objectives are to nd the trigger word in S, its type including None, and the relation between the trigger word and et from a set of prede ned event types. As event types are determined according to entity types, we do not make a model for event type detection. Hence, we focus on trigger type and trigger word detection. If there is a trigger word for an entity in a given sentence, the event type is found by simple rules.

Two objectives are selected to address this problem. Our base model is a transformer-based pretrained architecture, which extracts a xed-length sentence representation and token representations from an input sentence with entity markers indicating the entity's location. The xed-length sentence representation is utilized to detect the type of the trigger word in the sentence with a given annotated entity. If there is a trigger word in the sentence, the event type is determined by the type of the given entity from a simple lookup table shown in Table 3.

We propose an approach similar to question answering methods [ 2 ] to nd the span of the trigger word. Our trigger word span model predicts probabilities of start and end tags with the token representations which are produced by the transformer-based pretrained architecture. Trigger word span is the sequence between tokens with the highest start and end probabilities.

Our proposed architecture is jointly trained, as shown in Figure 2. Di erent pretrained transformers with several optimizers, learning rates, and weight decays are evaluated on the development set by exact F1 scores. The considered settings are summarized below. The con guration for our best model is shown in bold.

{ Transformer Architectures: BioBERT5, BERTLarge6, BioBERTLarge7 { Optimizer: AdamW, Adam, SGD { Learning Rate: 1e 5, 1e 6, 1e 4, 1e 3 { Weight Decay: 0, 0:1, 0:01 5 https://huggingface.co/monologg/biobert_v1.1_pubmed 6 https://github.com/google-research/bert 7 https://huggingface.co/trisongz/biobert_large_cased iandetPr . .

irgeT iDstrbuon iDstrbuon . .

. . nstaioeRpr nIput rokup*W irgeT eDtcion nIput : :

LCS LCS 1w w 1

E<> 1w w i /<E> w n

ESP SEP

As shown in Table 4, the best performing pretrained transformer model is BioBERT with AdamW optimizer, even though the complexities of BERTLarge and BioBERTLarge are higher than BioBERT. BERTLarge and BioBERTLarge have 24 layers, 16 heads, and 340 million parameters while BioBERT has 12 layers, 12 heads, and 110 million parameters. Besides, while BERTLarge is pretrained on English Wikipedia and Book Corpus, BioBERT and BioBERTLarge are pretrained on additional resources, i.e., Pubmed Abstracts and PMC full-text articles. Table 4 shows that BioBERT and BioBERTLarge perform better than BERTLarge. Our results suggest that the domain similarity between chemical patent documents and the pretraining corpus of BioBERT and BioBERTLarge leads to better performance. In [ 10 ], it is shown that the generalization capability of the AdamW optimizer is better than the Adam and SGD optimizers and our results support this claim.

There are two di erent objectives, namely trigger type detection and trigger word span detection, in our nal architecture. Table 5 contains the results of the two objectives separately on the development set. The trigger type detection model achieves 0.9848 F1 score, whereas the accuracy of our trigger word span model is 0.9524.

Our nal model's performance is summarized in Table 6 for all objectives: trigger word, trigger type and event type detections. It achieves 0.7407 and 0.7234 in the main metric (exact F1) on the development and test sets, consecutively. 5

Conclusion and Future Work

In this paper, we introduce a transformer based approach for event extraction in chemical patent documents. We compare several pretrained transformer models with di erent settings and show that BioBERT's performance with the AdamW optimizer is better than both BERTLarge and BioBERTLarge for this task. Finally, we report our best model's performance separately on the trigger type and trigger word span detection tasks. Our best model, BioBERT, achieves 0.7234 exact F1 score on the test set.

As future work, we plan to extend our study to enable the detection of multiple trigger words in a sentence by using a sequence labeling setup with the BIO encoding. Thus, we will consider entities having relations with more than one trigger word. In addition, we will design a two-stage model that rstly detects the trigger word span and then classi es the trigger type as an alternative to our jointly trained model. 6

Acknowledgements

GEBIP Award of the Turkish Academy of Sciences (to A.O.) is gratefully acknowledged. 18. Zhang, W., Ding, X., Liu, T.: Learning target-dependent sentence representations for chinese event detection. In: China Conference on Information Retrieval. pp. 251{262. Springer (2018) 19. Zhao, Y., Wan, H., Gao, J., Lin, Y.: Improving relation classi cation by entity pair graph. In: Asian Conference on Machine Learning. pp. 1156{1171 (2019) 20. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based bidirectional long short-term memory networks for relation classi cation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 207{212 (2016)

Baldini

Soares , L. , FitzGerald , N., Ling , J. , Kwiatkowski , T. : Matching the blanks: Distributional similarity for relation learning . In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . pp. 2895 { 2905 . Association for Computational Linguistics, Florence, Italy (Jul 2019 ). https://doi.org/10.18653/v1/ P19 -1279, https://www.aclweb.org/ anthology/P19-1279

2. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of deep bidirectional transformers for language understanding . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). pp. 4171 { 4186 . Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019 ). https://doi.org/10.18653/v1/ N19 -1423, https://www. aclweb.org/anthology/N19-1423

3. Dodge , J. , Ilharco , G. , Schwartz , R. , Farhadi , A. , Hajishirzi , H. , Smith , N. : Finetuning pretrained language models: Weight initializations, data orders, and early stopping . arXiv preprint arXiv: 2002 . 06305 ( 2020 )

4. He , J. , Nguyen , D.Q. , Akhondi , S.A. , Druckenbrodt , C. , Thorne , C. , Hoessel , R. , Afzal , Z. , Zhai , Z. , Fang , B. , Yoshikawa , H. , Albahem , A. , Cavedon , L. , Cohn , T. , Baldwin , T. , Verspoor , K. : Overview of chemu 2020: Named entity recognition and event extraction of chemical reactions from patents . In: Arampatzis, A. , Kanoulas , E. , Tsikrika , T. , Vrochidis , S. , Joho , H. , Lioma , C. , Eickho , C. , Neveol , A. , Cappellato , L. , Ferro , N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction . Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020 ), vol. 12260 . Lecture Notes in Computer Science ( 2020 )

5. Kambhatla , N.: Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction . In: Proceedings of the ACL Interactive Poster and Demonstration Sessions . pp. 178 { 181 ( 2004 )

6. Krallinger , M. , Rabal , O. , Leitner , F. , Vazquez , M. , Salgado , D. , Lu , Z. , Leaman , R. , Lu , Y. , Ji , D. , Lowe , D.M. , et al.: The chemdner corpus of chemicals and drugs and its annotation principles . Journal of cheminformatics 7(1) , 1 { 17 ( 2015 )

7. Lee , J. , Yoon , W. , Kim , S. , Kim , D. , Kim , S. , So , C.H. , Kang , J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining . Bioinformatics 36 ( 4 ), 1234 { 1240 ( 2020 )

8. Li , D. , Huang , L. , Ji , H., Han, J .: Biomedical event extraction based on knowledgedriven tree-lstm . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). pp. 1421 { 1430 ( 2019 )

9. Liu , S. , Chen , Y. , He , S. , Liu , K. , Zhao , J. : Leveraging framenet to improve automatic event detection . In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . pp. 2134 { 2143 ( 2016 )

10. Loshchilov , I. , Hutter , F. : Decoupled weight decay regularization . In: International Conference on Learning Representations ( 2018 )

11. Mikolov , T. , Chen , K. , Corrado , G. , Dean , J.: E cient estimation of word representations in vector space . arXiv preprint arXiv:1301.3781 ( 2013 )

12. Nguyen , T.H. , Grishman , R.: Event detection and domain adaptation with convolutional neural networks . In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) . pp. 365 { 371 ( 2015 )

13. Pennington , J. , Socher , R. , Manning , C.D.: Glove: Global vectors for word representation . In: Empirical Methods in Natural Language Processing (EMNLP) . pp. 1532 { 1543 ( 2014 ), http://www.aclweb.org/anthology/D14-1162

14. S tre, R., Yoshida , K. , Yakushiji , A. , Miyao , Y. , Matsubayashi , Y. , Ohta , T. : Akane system: protein-protein interaction pairs in biocreative2 challenge, ppi-ips subtask . In: Proceedings of the second biocreative challenge workshop . vol. 209 , p. 212 . Madrid ( 2007 )

15. Vaswani , A. , Shazeer , N. , Parmar , N. , Uszkoreit , J. , Jones , L. , Gomez , A.N. , Kaiser , L. , Polosukhin , I. : Attention is all you need . ArXiv abs/1706 .03762 ( 2017 )

16. Yang , S. , Feng , D. , Qiao , L. , Kan , Z. , Li , D. : Exploring pre-trained language models for event extraction and generation . In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . pp. 5284 { 5294 ( 2019 )

17. Zeng , D. , Liu , K. , Lai , S. , Zhou , G. , Zhao , J. : Relation classi cation via convolutional deep neural network . In: Proceedings of COLING 2014 , the 25th International Conference on Computational Linguistics: Technical Papers . pp. 2335 { 2344 ( 2014 )