-

NLPatVCU CLEF 2020 ChEMU Shared Task System Description

Darshini Mahendran

Gabrielle Gurdin

Nastassja Lewinski

Christina Tang

Bridget T. McInnes

btmcinnesg@vcu.edu 0 0 Virginia Commonwealth University , Richmond VA 23220 , USA

This paper describes our team's participation in the Tracks 1 & 2 from Conference and Labs of the Evaluation Forum (CLEF 2020) Challenge organized by Cheminformatics Elsevier Melbourne University for extracting information over chemical reactions from patents. We discuss our systems: MedaCy, a python-based supervised multi-class entity recognition system, and RelEx, a python-based relation extraction system which includes rule-based and supervised learning pipelines. Our best model for Task 1 obtained an overall relaxed precision of 0.95 and exact precision of 0.87; relaxed recall of 0.99 and exact recall of 0.86; and relaxed F1 score of 0.97 and exact F1 score of 0.87. Our best model for Task 2 obtained an overall precision of 0.80; recall of 0.54; and F1 score of 0.65.

Named Entity Recognition (NER) (RE) Event Extraction (EE) Relation Extraction

Chemical Patents are a primary source for information about novel chemicals and chemical reactions. With the increasing volume of such patents, the dissemination of information about these chemicals and chemical reactions has become even more labor and time intensive. This information can be used to discover new chemicals and synthetic pathways[ 1 ][ 11 ]. Therefore, informatics tools for automatically extracting information from these documents are more important than ever.

The process of extracting relevant information from chemical patents has been referred to as chemical reaction detection [ 12 ], and two of the main steps in this process are identifying the di erent parts of a chemical reaction within these documents and then identifying the relationships between them. This can be accomplished with Named Entity Recognition (NER) { the automatic labeling of certain spans within text corresponding to speci c labels; and Event Extraction (EE) { the automatic classifying and linking entities based on their relationships to each other.

The CLEF 2020 ChEMU [ 7 ] Task 1 aims to create systems to perform NER over chemical patents as the rst step in chemical reaction detection. Speci cally, the goal of this task is to automatically identify chemical compounds based on the role they play in a reaction, as well as other relevant information such as yield and temperature. The CLEF 2020 ChEMU Task 2 aims to create systems to perform EE over the entities to identify the individual steps in the reaction.

In this paper, we describe our participation in the CLEF 2020 ChEMU Task 1 and Task 2 Challenge. For this challenge, we used our python framework MedaCy 1 to automatically identify the experimental parameters associated with the reaction including the trigger words used to link the parameters; and RelEx 2 to automatically link the trigger words with the experimental parameters to provide the sequence of steps within the reaction. MedaCy contains a number of supervised multi-label sequence classi cation algorithms for NER. RelEx contains a rule-based and supervised learning-based algorithms to identify relations between entities. Our best models for Task 1 obtained an overall relaxed precision of 0.95 and exact precision of 0.87; relaxed recall of 0.99 and exact recall of 0.86; and relaxed F1 score of 0.97 and exact F1 score of 0.87. Our best model for Task 2 obtained an overall precision of 0.80; recall of 0.54; and F1 score of 0.65. 1 https://github.com/NLPatVCU/medaCy/ 2 https://github.com/NLPatVCU/RelEx/tree/CLEF 2020 includes 10 di erent entity labels described as shown in the Table 1. The ARG1 event label corresponds to relations between a trigger word (REACTION STEP, WORKUP) and chemical compound entities. Table 2 shows the event statistics of the training dataset. The ARGM event label corresponds to the relations between a trigger word and temperature, time, or yield entities.

TEMPERATURE

Triggers RWEOARCKTUIOPN STEP To identify the experimental parameters and triggers from the data, we use MedaCy's bidirectional Long Short Term Memory (LSTM) units with a Conditional Random Field (CRF) output layer implemented in PyTorch [ 9 ]. LSTMs [ 4 ] are a type of recurrent neural network. They take the current input example as well as what they have seen in the past as their input. Hence, they have two sources of input: their current state and their past states. This allows them to connect previous observations, such as words in a sentence, and learn dependencies of these words over arbitrarily long distances. They incorporate the functionality to identify what information that should be passed to the next component and what information should not, allowing for only relevant information to be passed on. For bi-directional LSTMs (biLSTMs), data are processed in both directions with two separate hidden layers, which are then fed forward into the same output layer. This allows the system to exploit context in both directions. A linear-chain CRF is used to assign the nal class probability. CRFs are a sequence learning algorithm which incorporate the interdependence between labels into model induction and prediction. Therefore, using a CRF output allows the model to use the preceding label predictions to inform what labels are most likely to follow or to occur close together.

The input to our biLSTM+CRF model in this work is pre-trained word embeddings [ 6 ] in combination with character embeddings [ 3 ]. These embeddings are concatenated and then passed through the network.

The word2vec [ 6 ] embeddings are derived from a neural network that learns a representation of a word-word co-occurrence matrix. The character embeddings are learned using a biLSTM and concatenated onto the word2vec embeddings. Fig 1 shows a simple example for the term mice. This network is valuable for providing input especially in the case of out-of-vocabulary words. In the case of chemical patents, many tokens are long chemical names that do not show up in the dataset used to train word embeddings, such as, the reaction product 3-Isobutyl-5-methyl-1-(oxetan-2-ylmethyl)-6-[(2-oxoimidazolidin-1yl)methyl]thieno[2,3-d]pyrimidine-2,4(1H,3H)-dione. To identify the trigger words, we use our NER system medaCy as described above. To identify the chemical arguments between the trigger words and the entities, we use RelEx, a python-based Relation Extraction Framework developed to identify relations between two entities. The framework contains two main components: 1) Rule-based Method and 2) Convolutional Neural Network(CNN)based Method. In this section, we provide a brief overview of each component. Rule-based Method. RelEx's rule-based method utilizes the co-location information of the trigger words to determine that, with respect to the entity if the word is referring to the trigger word or not. We use a breadth- rst search algorithm to nd the closest occurrence of the trigger word on either side of the entity and all the closest occurrences of the trigger words within a sentence. For each entity in the data set, we traverse both sides until the closest occurrence of the trigger word is found using the provided span values of the entities. We apply di erent traversal techniques and determine the best traversal technique. The following are the di erent traversal techniques we use: traverse left-only, traverse right-only, traverse left- rst-then-right, and vice versa. In this work, we use left-only traversal where we traverse to the left side of the entity mention nding the closest occurrence of the trigger words. CNN-based Method. RelEx's CNN-based method automatically extracts and classi es the events. CNNs are a form of deep neural networks and mostly consist of four main layers [ 8 ]: embedding, convolution, pooling and feed-forward layers. CNNs allow word embeddings to train on the input text itself or use pre-trained word vectors obtained from an external resource. Initially the convolution layer which is a lter learns using the backpropagation algorithm and extracts features from the input. Then the maxpooling layer uses the position information and helps to extract the most signi cant feature from the output of the convolution lter. Finally the feed-forward layer uses a softmax classi er that performs classi cation.

In this work, for each Trigger word-Entity pair we perform a binary classi cation to identify whether there is a relation between the trigger word and the entity or not. First, we identify and extract the sentence where a Trigger word-Entity pair pair lies and based on where the text spans are located in the sentence, we divide the sentence into segments as follows: { preceding - tokenized words before the rst concept { concept 1 - tokenized words in the rst concept { middle - tokenized words between the 2 concepts { concept 2 - tokenized words in the second concept { succeeding - tokenized words after the second concept

Figure 2 shows an abstract view of the construction of the CNN-based model. A segment is represented by a matrix of k N where k is the dimension of the word embeddings and N is the number of words in a segment. In this work, we use ChemPatent pre-trained word embeddings. We construct separate convolution units for each segment and concatenate before the xed-length vector is fed to the dense layer that performs the classi cation. Each convolution unit applies a sliding window that processes the segment and feeds the output to the max-pooling layer to extract important features independent of their location. The output features of the max-pooling layer of each segment are then attened and concatenated into a vector before feeding it into the fully connected feedforward layer. The vector is nally fed into a softmax layer to perform the binary classi cation whether the relationship exists or not. 3.3

Experimental Details

Word Embeddings . We explore two pre-trained word embeddings: 1) ChemPatent embeddings [ 7 ] trained over a collection of 84,076 full patent documents (1B tokens); and 2) WikiPubmed embeddings [ 10 ] in our methods.

MedaCy . We used PyTorch [ 9 ] for the implementation of the BiLSTM+CRF model. Models were trained for 40 epochs, and optimized using stochastic gradient descent. A window size of 0 generated the best results. Tokenization was conducted using the SpaCy tokenizer. The labels are strictly the entity types. RelEx . We used Keras [ 2 ] for the implementation of the CNN architecture. We experimented with di erent sliding window sizes, lter sizes, loss functions for ne-tuning and in this work, small lter sizes generated best results for small lter sizes. We applied the dropout technique on the output of the convolution layer to regularize the model. We used Adam and rmsprop optimizers to minimize our loss function. We trained the models for 5 -10 epochs to avoid over- tting. 3.4

Evaluation

For Tasks 1 and 2, we report the precision, recall, and F1 scores. Precision is the ratio between correctly predicted mentions over the total set of predicted mentions for a speci c entity; recall is the ratio of correctly predicted mentions over the actual number of mentions, and F1 is the harmonic mean between precision and recall. For Task 1, we report both the exact and relaxed results for each entity category. In exact evaluation, two annotations are equal only if they have the same tag with exactly matching spans. With the relaxed evaluation, two annotations are equal if they share the same tag and their spans overlap with each other.

Results and Discussion

In this section, we discuss the results for Task 1 and 2. 4.1

Task 1: Named Entity Recognition

Results. Tables 3 - 5 show the exact and relaxed precision, recall, and F1 scores obtained over the testing set for identifying the named entities in each of our three runs. Run 1 model was trained over the training data using the biLSTM+CRF with the CheMU Patent embeddings; run 2 model was trained over the training data using the biLSTM+CRF with the WikiPubmed embeddings; and run 3 model was trained over the training and development data combined with the biLSTM+CRF using the WikiPubmed embeddings. Table 6 shows the baseline results using the CRF-based NER system BANNER [ 5 ] provided by the organizers and the overall results of each of our runs. Overall, the biLSTM+CRF model trained using patent embeddings returned the best results, obtaining a 96.78% system-wide relaxed F1 score. This model performed better than baseline for all entity labels except EXAMPLE LABEL, for which it performed almost identically. This model's performance is likely due to the domain-relevant information contained within the embeddings. The best performance for exact evaluation resulted from the model trained over a combination of the training and development sets. However, this model's overall performance was worse than the baseline model. Still, we believe this model's better performance compared to the other models may be due to the increase of volume of data used to train by the addition of the development set.

Although the exact results for the models performed slightly worse than baseline, each of the models performed better on the relaxed results, with the model trained over patent embeddings performing best. This discrepancy may be due to the way that MedaCy handles entity classi cation. Within MedaCy, each Entity EXAMPLE LABEL OTHER COMPOUND REACTION PRODUCT REAGENT CATALYST SOLVENT STARTING MATERIAL TEMPERATURE TIME YIELD OTHER YIELD PERCENT System Entity EXAMPLE LABEL OTHER COMPOUND REACTION PRODUCT REAGENT CATALYST SOLVENT STARTING MATERIAL TEMPERATURE TIME YIELD OTHER YIELD PERCENT System trained over training data with WikiPubmed embeddings trained over training and development data with WikiPubmed embeddings individual token is given its own label (`O' for unlabelled entities), so for entities with spans long than one token, the entity may have only been partially labelled. For instance, in many cases of the TEMPERATURE label, MedaCy labeled `C' or ` C,' excluding the number preceding the temperature symbol. This may also account for why each model performed poorly for the TEMPERATURE label when evaluating in exact mode, but performed well when evaluating in relaxed mode.

Error Analysis. Confusion matrices for the three runs over the testing dataset are shown in Figures 3 - 5. Rows in the matrix represent annotated entities and columns represent predicted entities. For instance, in 3, YIELD OTHER (Y.O) was misidenti ed as YIELD PERCENT (Y.P.) 28 times. Table 7 shows the acronym of each of the labels used in the confusion matrices. The colors in the matrix indicate the density of the entities and the system annotations. The bottom right corner of each matrix is darker because of the large number of OTHER COMPOUND (O.C) entities in the dataset.

The majority of mislabeling occurred when more speci c entity labels, such as STARTING MATERIAL (S.M.), REAGENT CATALYST (R.C.), or REACTION PRODUCT (R.P.), were predicted to be OTHER COMPOUND (O.C.). This may be because the models were able to predict that certain spans contained chemical named, but were too general and unable to predict the speci c label. Additionally, spans annotated as OTHER COMPOUND (O.C.) were consistently predicted to be more speci c types of compounds. It seems that while the models are able to predict which spans contain chemical compounds, they are less able to distinguish between the types of compounds. 4.2

Task 2: Event Extraction

Results. Tables 8 - 10 show the exact match precision, recall, and F1 scores obtained over the testing set for each of our three runs. Run 1 used our RelEx's CNN-based system trained over the ChemPatent embeddings with the trigger words identi ed using medaCy's biLSTM+CRF trained over the ChemPatent embeddings. Run 2 used our RelEx's rule-based system with the trigger words identi ed using medaCy's biLSTM+CRF trained with ChemPatent embeddings. Run 3 used our rule-based system with the trigger words identi ed using medaCy's biLSTM+CRF trained with WikiPubmed embeddings. Table 11 shows the comparison with the co-occurrence baseline provided by the organizers of the ChEMU challenge and the overall results from each of our runs.

The overall results show that all three runs obtain a higher precision and F1 score than the baseline but not recall. The system results show that the CNN-based (Run 1) model obtains a higher overall F1 score than both the rulebased (Run 2 & 3) models. When training with CNN the overall precision of the predictions is high but the recall is low, this shows that CNN failed to classify all instances but was able to classify most of the predicted instances correct. Also, we can see the performance of each event class (Trigger wordEntity pair) in Run 1 is proportional to the number of instances in the training set. For example, event classes, REACTION STEP-REAGENT CATALYST tem with trigger words identi ed using medaCy trained with CheMU patent embeddings tem with trigger words identi ed using medaCy trained with CheMU patent embeddings WORKUP WORKUP

Entity OTHER COMPOUND REACTION PRODUCT REAGENT CATALYST SOLVENT STARTING MATERIAL OTHER COMPOUND REACTION PRODUCT SOLVENT STARTING MATERIAL TEMPERATURE TIME YIELD OTHER YIELD PERCENT TEMPERATURE

TIME

System WORKUP WORKUP

YIELD PERCENT and REACTION STEP-STARTING MATERIAL, have more training instances and obtain a high F1 score, whereas the event classes, WORKUP-SOLVENT and WORKUP-STARTING MATERIAL, have a very few instances and obtain an F1 score of zero.

The rule-based models (Run 2 & 3) obtain comparatively high recall and low precision. The rule-based methods predicts all the closest occurrences of the trigger words of the entity compounds in the traversal area, however many WORKUP WORKUP

YIELD PERCENT System predictions are false positives. Since the number of instances in the training set does not a ect the rule-based methods, the performance of the event classes that have few instances performs better. For example, the event classes, WORKUPTIME and REACTION STEP-OTHER COMPOUND, obtained zero F1 score with CNN-based model but performed better with the rule-based models obtaining F1 scores of 0.43 and 0.88, respectively.

Table 12 shows the arithmetic mean and weighted arithmetic mean of the precision, recall, and F1 score for both trigger word classes for each run. Bold terms indicate the best performance for each trigger word. We can see the CNNbased method (Run 1) performs well with the REACTION STEP classes and poor with WORKUP classes. This is because most of the REACTION STEP classes have more instances for the CNN to train on but most of the WORKUP classes have few instances. This is the same reason the rule-based methods (Run 2 & 3) perform better with those classes. The weighted arithmetic mean results contradict with the arithmetic mean results, as we can see a notable di erence in the F1 score when comparing the classes of REACTION STEP and WORKUP. The WORKUP event class obtains a better performance due to the signi cant imbalance between the individual event classes. The weighted arithmetic mean WORKUP

Entity Run 1 Run 2 Run 3 Run 1 Run 2 Run 3 allocates more weight to the classes that have more instances and vice versa we see an improvement in the performance of both classes.

Error Analysis. Tables 13 and 14 show a detailed error analysis of the CNNbased (Run 1) and the rule-based method (Run 2) respectively where the trigger words are trained with ChemPatent embeddings. Here we report the number of true positives (tp), false positives (fp), and false negatives (fn) and also "fpm" and "fnm", two metrics that represent the number of false positives and false negatives, of which the corresponding entities are missing.

The results are consistent with the previous observations from the tables 8, 9 and 10. We can see REACTION STEP classes performed better than the WORKUP classes. It is safe to say that, class imbalance plays a signi cant role in the miss-annotation of the instances. The results also show that the rulebased model signi cantly over annotates given the number of false positives. For example, the rule-based model (Run 2) identi ed 379 instances of the WORKUPREACTION PRODUCT event class with only four being true positives. WORKUP WORKUP

System

YIELD PERCENT WORKUP WORKUP

System 5

Conclusion

Entity OTHER COMPOUND REACTION PRODUCT REAGENT CATALYST SOLVENT STARTING MATERIAL OTHER COMPOUND REACTION PRODUCT REAGENT CATALYST SOLVENT STARTING MATERIAL TEMPERATURE TIME YIELD OTHER YIELD PERCENT TEMPERATURE TIME YIELD OTHER YIELD PERCENT We trained three biLSTM+CRF models over di erent pre-trained word embeddings, as well as di erently sized datasets. Results show that while these models did not outperform the baseline model when evaluating exact span matches, the models outperformed the baseline when evaluating in relaxed mode. A model trained using word embeddings trained over chemical patents performed best when evaluating in relaxed mode, while a model trained using biomedical word embeddings and a combination of the training and development datasets performed best when evaluated on exact span matches. Errors primarily occurred because of issues with the model distinguishing between di erent entity labels, such as models mislabeling entities annotated as OTHER COMPOUND for more speci c labels, like REACTION PRODUCT or STARTING MATERIAL. Additionally, the way that MedaCy predicts entity labels may have contributed to errors with labeling entity spans fully. Future work will focus on better distinguishing between di erent types of chemical compounds, as well as looking into models based on language models.

We used one CNN-based model and two rule-based models to extract events and according to the results, all three models outperformed the baseline model. Results show that the CNN-based method outperforms the rule-based methods, especially with the REACTION STEP classes as those classes have more instances to train on. Meanwhile, as the rule-based methods do not require training instances to train they perform better with WORKUP classes. In the future, we plan to explore building a hybrid model with both CNN and rule-based methods to increase the performance.

1. Bort , W. , Baskin , I.I. , Sidorov , P. , Marcou , G. , Horvath , D. , Madzhidov , T. , Varnek , A. , Gimadiev , T. , Nugmanov , R. , Mukanov , A. : Discovery of novel chemical reactions by deep generative recurrent neural network ( 2020 )

2. Charles , P. : Project title . https://github.com/charlespwd/project-title ( 2013 )

3. Gridach , M. : Character-level neural network for biomedical named entity recognition . Journal of biomedical informatics 70 , 85 { 91 ( 2017 )

4. Huang , Z. , Xu , W. , Yu , K. : Bidirectional lstm-crf models for sequence tagging . arXiv preprint arXiv:1508 . 01991 ( 2015 )

5. Leaman , R. , Gonzalez , G.: Banner: an executable survey of advances in biomedical named entity recognition . In: Biocomputing 2008 , pp. 652 { 663 . World Scienti c ( 2008 )

6. Mikolov , T. , Sutskever , I. , Chen , K. , Corrado , G. , Dean , J. : Distributed representations of words and phrases and their compositionality . In: Advances in neural information processing systems . pp. 3111 { 3119 ( 2013 )

7. Nguyen , D.Q. , Zhai , Z. , Yoshikawa , H. , Fang , B. , Druckenbrodt , C. , Thorne , C. , Hoessel , R. , Akhondi , S.A. , Cohn , T. , Baldwin , T. , et al.: Chemu: Named entity recognition and event extraction of chemical reactions from patents . In: European Conference on Information Retrieval . pp. 572 { 579 . Springer ( 2020 )

8. Nguyen , T.H. , Grishman , R.: Relation extraction: Perspective from convolutional neural networks . In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing . pp. 39 { 48 ( 2015 )

9. Paszke , A. , Gross , S. , Massa , F. , Lerer , A. , Bradbury , J. , Chanan , G. , Killeen , T. , Lin , Z. , Gimelshein , N. , Antiga , L. , Desmaison , A. , Kopf , A. , Yang , E. , DeVito , Z. , Raison , M. , Tejani , A. , Chilamkurthy , S. , Steiner , B. , Fang , L. , Bai , J. , Chintala , S. : Pytorch: An imperative style, high-performance deep learning library . In: Wallach, H. , Larochelle , H. , Beygelzimer , A., d Alche-Buc, F. , Fox , E. , Garnett , R . (eds.) Advances in Neural Information Processing Systems 32 , pp. 8024 { 8035 . Curran Associates, Inc. ( 2019 )

10. Pyysalo , S. , Ginter , F. , Moen , H. , Salakoski , T. , Ananiadou , S. : Distributional semantics resources for biomedical text processing . In: The 5th International Symposium on Languages in Biology and Medicine ( 2013 )

11. Wang , K. , Wang , L. , Yuan , Q. , Luo , S. , Yao , J. , Yuan , S. , Zheng , C. , Brandt , J. : Construction of a generic reaction knowledge base by reaction data mining . Journal of Molecular Graphics and Modelling 19 ( 5 ), 427 { 433 ( 2001 )

12. Yoshikawa , H. , Nguyen , D.Q. , Zhai , Z. , Druckenbrodt , C. , Thorne , C. , Akhondi , S.A. , Baldwin , T. , Verspoor , K. : Detecting chemical reactions in patents ( 2019 )