NLPatVCU CLEF 2020 ChEMU Shared Task
             System Description

 Darshini Mahendran, Gabrielle Gurdin, Nastassja Lewinski, Christina Tang,
                        and Bridget T. McInnes

            Virginia Commonwealth University, Richmond VA 23220, USA
           {mahendrand,gurding,nalewinski,ctang2,btmcinnes}@vcu.edu


        Abstract. This paper describes our team’s participation in the Tracks
        1 & 2 from Conference and Labs of the Evaluation Forum (CLEF 2020)
        Challenge organized by Cheminformatics Elsevier Melbourne University
        for extracting information over chemical reactions from patents. We dis-
        cuss our systems: MedaCy, a python-based supervised multi-class entity
        recognition system, and RelEx, a python-based relation extraction sys-
        tem which includes rule-based and supervised learning pipelines. Our
        best model for Task 1 obtained an overall relaxed precision of 0.95 and
        exact precision of 0.87; relaxed recall of 0.99 and exact recall of 0.86; and
        relaxed F1 score of 0.97 and exact F1 score of 0.87. Our best model for
        Task 2 obtained an overall precision of 0.80; recall of 0.54; and F1 score
        of 0.65.

        Keywords: Named Entity Recognition (NER) · Relation Extraction
        (RE) · Event Extraction (EE)


1     Introduction

Chemical Patents are a primary source for information about novel chemicals
and chemical reactions. With the increasing volume of such patents, the dissem-
ination of information about these chemicals and chemical reactions has become
even more labor and time intensive. This information can be used to discover
new chemicals and synthetic pathways[1][11]. Therefore, informatics tools for
automatically extracting information from these documents are more important
than ever.
    The process of extracting relevant information from chemical patents has
been referred to as chemical reaction detection [12], and two of the main steps in
this process are identifying the different parts of a chemical reaction within these
documents and then identifying the relationships between them. This can be
accomplished with Named Entity Recognition (NER) – the automatic labeling of
certain spans within text corresponding to specific labels; and Event Extraction
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
(EE) – the automatic classifying and linking entities based on their relationships
to each other.
    The CLEF 2020 ChEMU [7] Task 1 aims to create systems to perform NER
over chemical patents as the first step in chemical reaction detection. Specifically,
the goal of this task is to automatically identify chemical compounds based on
the role they play in a reaction, as well as other relevant information such as
yield and temperature. The CLEF 2020 ChEMU Task 2 aims to create systems
to perform EE over the entities to identify the individual steps in the reaction.
    In this paper, we describe our participation in the CLEF 2020 ChEMU Task 1
and Task 2 Challenge. For this challenge, we used our python framework MedaCy
1
  to automatically identify the experimental parameters associated with the re-
action including the trigger words used to link the parameters; and RelEx 2 to
automatically link the trigger words with the experimental parameters to pro-
vide the sequence of steps within the reaction. MedaCy contains a number of
supervised multi-label sequence classification algorithms for NER. RelEx con-
tains a rule-based and supervised learning-based algorithms to identify relations
between entities. Our best models for Task 1 obtained an overall relaxed preci-
sion of 0.95 and exact precision of 0.87; relaxed recall of 0.99 and exact recall of
0.86; and relaxed F1 score of 0.97 and exact F1 score of 0.87. Our best model for
Task 2 obtained an overall precision of 0.80; recall of 0.54; and F1 score of 0.65.


                     Table 1: Entity type statistics of the dataset
       Entity Type              Definition
                                A product is a substance that is formed during a chemical
       REACTION PRODUCT (R.P.)
                                reaction.
                                A substance that is consumed in the course of a chemi-
       STARTING MATERIAL (S.M.) cal reaction providing atoms to products is considered as
                                starting material.
                                A reagent is a compound added to a system to cause or
                                help with a chemical reaction. Compounds like catalysts,
       REAGENT CATALYST (R.C.)
                                bases to remove protons or acids to add protons must be
                                also annotated with this tag.
                                A solvent is a chemical entity that dissolves a solute re-
       SOLVENT (S)
                                sulting in a solution.
                                Other chemical compounds that are not the products,
       OTHER COMPOUND (O.C.)
                                starting materials, reagents, catalysts and solvents.
       TIME                     The reaction time of the reaction.
       TEMPERATURE (Temp)       The temperature of the reaction.
       YIELD PERCENT (Y.P.)     Yields given in percent values.
       YIELD OTHER (Y.O.)       Yields provided in other units than %.


2     Data
The CLEF 2020 data corpus [7] includes chemical entities and events that ex-
plain the sequence of steps that leads a chemical reaction to an end product. It
1
    https://github.com/NLPatVCU/medaCy/
2
    https://github.com/NLPatVCU/RelEx/tree/CLEF 2020
includes 10 different entity labels described as shown in the Table 1. The ARG1
event label corresponds to relations between a trigger word (REACTION STEP,
WORKUP) and chemical compound entities. Table 2 shows the event statistics
of the training dataset. The ARGM event label corresponds to the relations
between a trigger word and temperature, time, or yield entities.


Table 2: Number of entity types and trigger words in the training data and their
event relations
         Events Entities          Instances REACTION STEP WORKUP
                EXAMPLE LABEL         886          -           -
                REACTION PRODUCT     2052        1101         11
                STARTING MATERIAL    1754        1747          4
         ARG1
                REAGENT CATALYST     1281        1272          -
                SOLVENT              1140        1134          4
                OTHER COMPOUND       4640         161       4097
                YIELD PERCENT         955         937          1
                YIELD OTHER          1061        1043          2
         ARGM
                TIME                 1059         839         81
                TEMPERATURE          1515         813        242
                    REACTION STEP     3815
         Triggers
                    WORKUP            3053


3     Methods
This section describes the underlying methodology of our system.

3.1   Named Entity Recognition and Trigger Detection
To identify the experimental parameters and triggers from the data, we use
MedaCy’s bidirectional Long Short Term Memory (LSTM) units with a Condi-
tional Random Field (CRF) output layer implemented in PyTorch [9]. LSTMs
[4] are a type of recurrent neural network. They take the current input exam-
ple as well as what they have seen in the past as their input. Hence, they have
two sources of input: their current state and their past states. This allows them
to connect previous observations, such as words in a sentence, and learn de-
pendencies of these words over arbitrarily long distances. They incorporate the
functionality to identify what information that should be passed to the next com-
ponent and what information should not, allowing for only relevant information
to be passed on. For bi-directional LSTMs (biLSTMs), data are processed in
both directions with two separate hidden layers, which are then fed forward into
the same output layer. This allows the system to exploit context in both direc-
tions. A linear-chain CRF is used to assign the final class probability. CRFs are
a sequence learning algorithm which incorporate the interdependence between
labels into model induction and prediction. Therefore, using a CRF output al-
lows the model to use the preceding label predictions to inform what labels are
most likely to follow or to occur close together.
    The input to our biLSTM+CRF model in this work is pre-trained word
embeddings [6] in combination with character embeddings [3]. These embeddings
are concatenated and then passed through the network.
    The word2vec [6] embeddings are derived from a neural network that learns
a representation of a word-word co-occurrence matrix. The character embed-
dings are learned using a biLSTM and concatenated onto the word2vec em-
beddings. Fig 1 shows a simple example for the term mice. This network is
valuable for providing input especially in the case of out-of-vocabulary words.
In the case of chemical patents, many tokens are long chemical names that do
not show up in the dataset used to train word embeddings, such as, the reac-
tion product 3-Isobutyl-5-methyl-1-(oxetan-2-ylmethyl)-6-[(2-oxoimidazolidin-1-
yl)methyl]thieno[2,3-d]pyrimidine-2,4(1H,3H)-dione.


          Fig. 1: An illustration of how Character Embedding works


3.2   Task 2: Event Extraction
To identify the trigger words, we use our NER system medaCy as described
above. To identify the chemical arguments between the trigger words and the en-
tities, we use RelEx, a python-based Relation Extraction Framework developed
to identify relations between two entities. The framework contains two main com-
ponents: 1) Rule-based Method and 2) Convolutional Neural Network(CNN)-
based Method. In this section, we provide a brief overview of each component.

Rule-based Method. RelEx’s rule-based method utilizes the co-location in-
formation of the trigger words to determine that, with respect to the entity if
the word is referring to the trigger word or not. We use a breadth-first search
algorithm to find the closest occurrence of the trigger word on either side of the
entity and all the closest occurrences of the trigger words within a sentence. For
each entity in the data set, we traverse both sides until the closest occurrence
of the trigger word is found using the provided span values of the entities. We
apply different traversal techniques and determine the best traversal technique.
The following are the different traversal techniques we use: traverse left-only,
traverse right-only, traverse left-first-then-right, and vice versa. In this work, we
use left-only traversal where we traverse to the left side of the entity mention
finding the closest occurrence of the trigger words.


          Fig. 2: An illustration of our model for CNN-based method


CNN-based Method. RelEx’s CNN-based method automatically extracts and
classifies the events. CNNs are a form of deep neural networks and mostly consist
of four main layers [8]: embedding, convolution, pooling and feed-forward layers.
CNNs allow word embeddings to train on the input text itself or use pre-trained
word vectors obtained from an external resource. Initially the convolution layer
which is a filter learns using the backpropagation algorithm and extracts fea-
tures from the input. Then the maxpooling layer uses the position information
and helps to extract the most significant feature from the output of the convolu-
tion filter. Finally the feed-forward layer uses a softmax classifier that performs
classification.
     In this work, for each Trigger word-Entity pair we perform a binary clas-
sification to identify whether there is a relation between the trigger word and
the entity or not. First, we identify and extract the sentence where a Trigger
word-Entity pair pair lies and based on where the text spans are located in the
sentence, we divide the sentence into segments as follows:
 – preceding - tokenized words before the first concept
 – concept 1 - tokenized words in the first concept
 – middle - tokenized words between the 2 concepts
 – concept 2 - tokenized words in the second concept
 – succeeding - tokenized words after the second concept
    Figure 2 shows an abstract view of the construction of the CNN-based model.
A segment is represented by a matrix of k ∗ N where k is the dimension of the
word embeddings and N is the number of words in a segment. In this work,
we use ChemPatent pre-trained word embeddings. We construct separate con-
volution units for each segment and concatenate before the fixed-length vector
is fed to the dense layer that performs the classification. Each convolution unit
applies a sliding window that processes the segment and feeds the output to the
max-pooling layer to extract important features independent of their location.
The output features of the max-pooling layer of each segment are then flattened
and concatenated into a vector before feeding it into the fully connected feedfor-
ward layer. The vector is finally fed into a softmax layer to perform the binary
classification whether the relationship exists or not.

3.3   Experimental Details
Word Embeddings . We explore two pre-trained word embeddings: 1) ChemPatent
embeddings [7] trained over a collection of 84,076 full patent documents (1B to-
kens); and 2) WikiPubmed embeddings [10] in our methods.

MedaCy . We used PyTorch [9] for the implementation of the BiLSTM+CRF
model. Models were trained for 40 epochs, and optimized using stochastic gra-
dient descent. A window size of 0 generated the best results. Tokenization was
conducted using the SpaCy tokenizer. The labels are strictly the entity types.

RelEx . We used Keras [2] for the implementation of the CNN architecture.
We experimented with different sliding window sizes, filter sizes, loss functions
for fine-tuning and in this work, small filter sizes generated best results for small
filter sizes. We applied the dropout technique on the output of the convolu-
tion layer to regularize the model. We used Adam and rmsprop optimizers to
minimize our loss function. We trained the models for 5 -10 epochs to avoid
over-fitting.

3.4   Evaluation
For Tasks 1 and 2, we report the precision, recall, and F1 scores. Precision is
the ratio between correctly predicted mentions over the total set of predicted
mentions for a specific entity; recall is the ratio of correctly predicted mentions
over the actual number of mentions, and F1 is the harmonic mean between
precision and recall. For Task 1, we report both the exact and relaxed results for
each entity category. In exact evaluation, two annotations are equal only if they
have the same tag with exactly matching spans. With the relaxed evaluation,
two annotations are equal if they share the same tag and their spans overlap
with each other.
4     Results and Discussion

In this section, we discuss the results for Task 1 and 2.


4.1   Task 1: Named Entity Recognition

Results. Tables 3 - 5 show the exact and relaxed precision, recall, and F1
scores obtained over the testing set for identifying the named entities in each of
our three runs. Run 1 model was trained over the training data using the biL-
STM+CRF with the CheMU Patent embeddings; run 2 model was trained over
the training data using the biLSTM+CRF with the WikiPubmed embeddings;
and run 3 model was trained over the training and development data combined
with the biLSTM+CRF using the WikiPubmed embeddings. Table 6 shows the
baseline results using the CRF-based NER system BANNER [5] provided by the
organizers and the overall results of each of our runs.


Table 3: Run 1: Precision (P), Recall (R), and F1 results using biLSTM+CRF
trained over training data With CheMU patent embeddings
                                     Exact                     Relaxed
      Entity                  P       R         F1       P        R       F1
      EXAMPLE LABEL          0.94    0.95      0.94     0.94     0.98    0.96
      OTHER COMPOUND          0.9    0.82      0.86     0.97     0.99    0.98
      REACTION PRODUCT       0.84    0.83      0.83      0.9     0.97    0.94
      REAGENT CATALYST       0.85     0.9      0.87     0.88     0.99    0.93
      SOLVENT                0.91    0.94      0.93     0.92      1      0.96
      STARTING MATERIAL      0.85    0.84      0.85     0.91      1      0.95
      TEMPERATURE            0.63    0.63      0.63     0.99     0.99    0.99
      TIME                   0.88    0.88      0.88       1       1        1
      YIELD OTHER            0.95    0.98      0.97     0.96      1      0.98
      YIELD PERCENT          0.99    0.99      0.99       1       1        1
      System                 0.87    0.85     0.86     0.95     0.99     0.97


    Overall, the biLSTM+CRF model trained using patent embeddings returned
the best results, obtaining a 96.78% system-wide relaxed F1 score. This model
performed better than baseline for all entity labels except EXAMPLE LABEL,
for which it performed almost identically. This model’s performance is likely
due to the domain-relevant information contained within the embeddings. The
best performance for exact evaluation resulted from the model trained over a
combination of the training and development sets. However, this model’s overall
performance was worse than the baseline model. Still, we believe this model’s
better performance compared to the other models may be due to the increase of
volume of data used to train by the addition of the development set.
    Although the exact results for the models performed slightly worse than
baseline, each of the models performed better on the relaxed results, with the
model trained over patent embeddings performing best. This discrepancy may be
due to the way that MedaCy handles entity classification. Within MedaCy, each
Table 4: Run 2: Precision (P), Recall (R), and F1 results using biLSTM+CRF
trained over training data with WikiPubmed embeddings
                                       Exact                    Relaxed
       Entity                    P       R        F1      P        R       F1
       EXAMPLE LABEL            0.98    0.93     0.95    0.98     0.98    0.96
       OTHER COMPOUND           0.89    0.84     0.87    0.95     0.98    0.96
       REACTION PRODUCT         0.83    0.82     0.82     0.9     0.97    0.94
       REAGENT CATALYST         0.86    0.89     0.87    0.89      1      0.43
       SOLVENT                  0.94    0.91     0.93    0.95     0.99    0.97
       STARTING MATERIAL        0.85    0.83     0.84    0.91     0.99    0.95
       TEMPERATURE              0.63    0.63     0.63    0.99     0.99    0.99
       TIME                     0.88    0.87     0.87      1      0.99      1
       YIELD OTHER              0.97    0.98     0.97    0.98     0.98    0.98
       YIELD PERCENT             1      0.99     0.99      1      0.99    0.99
       System                  0.87     0.85    0.86     0.95     0.98    0.96


Table 5: Run 3: Precision (P), Recall (R), and F1 results using biLSTM+CRF
trained over training and development data with WikiPubmed embeddings
                                       Exact                    Relaxed
        Entity                   P       R       F1       P         R      F1
        EXAMPLE LABEL           0.96    0.94    0.95     0.95      0.96   0.95
        OTHER COMPOUND           0.9    0.84    0.87     0.96      0.98   0.97
        REACTION PRODUCT         0.8    0.82    0.81     0.88      0.98   0.93
        REAGENT CATALYST         0.9    0.88.   0.89     0.93      0.99   0.96
        SOLVENT                 0.94    0.93    0.94     0.94      0.99   0.96
        STARTING MATERIAL       0.88    0.86    0.87     0.92      0.99   0.95
        TEMPERATURE             0.63    0.63    0.63     0.99      0.99   0.99
        TIME                    0.88    0.88    0.88       1        1       1
        YIELD OTHER             0.98    0.98    0.98     0.98      0.99   0.98
        YIELD PERCENT          0.99.    0.99    0.99     0.99      0.99   0.99
        System                 0.87     0.86    0.87     0.95     0.98    0.97


individual token is given its own label (‘O’ for unlabelled entities), so for entities
with spans long than one token, the entity may have only been partially labelled.
For instance, in many cases of the TEMPERATURE label, MedaCy labeled ‘C’
or ‘◦ C,’ excluding the number preceding the temperature symbol. This may also
account for why each model performed poorly for the TEMPERATURE label
when evaluating in exact mode, but performed well when evaluating in relaxed
mode.

Error Analysis. Confusion matrices for the three runs over the testing dataset
are shown in Figures 3 - 5. Rows in the matrix represent annotated entities
and columns represent predicted entities. For instance, in 3, YIELD OTHER
(Y.O) was misidentified as YIELD PERCENT (Y.P.) 28 times. Table 7 shows
the acronym of each of the labels used in the confusion matrices. The colors
in the matrix indicate the density of the entities and the system annotations.
The bottom right corner of each matrix is darker because of the large number
of OTHER COMPOUND (O.C) entities in the dataset.
Fig. 3: Run 1 Confusion Matrix using biLSTM+CRF trained over training data
with CheMU patent embeddings


Fig. 4: Run 2 Confusion Matrix using biLSTM+CRF trained over training data
with WikiPubmed embeddings


Fig. 5: Run 3 Confusion Matrix using biLSTM+CRF trained over training +
development data with WikiPubmed embeddings
                         Table 6: Task 1 Baseline Results
                                  Exact                  Relax
                           P         R       F1   P      R       F1
              Run 1       0.87     0.85    0.86   0.95   0.99    0.97
              Run 2       0.87     0.85    0.86   0.95   0.98    0.96
              Run 3       0.87     0.85    0.87   0.95   0.98    0.97
              Baseline    0.91     0.87    0.89   0.92   0.95    0.94


                 Table 7: Key for the confusion matrix figurse
        Label                    Acronym      Label                Acronym
        EXAMPLE LABEL            E.L.         REACTION PRODUCT     R.P.
        STARTING MATERIAL        S.M.         REAGENT CATALYST     R.C.
        SOLVENT                  S            OTHER COMPOUND       O.C.
        YIELD PERCENT            Y.P.         YIELD OTHER          Y.O.
        TIME                     Time         TEMPERATURE          Temp


    The majority of mislabeling occurred when more specific entity labels, such
as STARTING MATERIAL (S.M.), REAGENT CATALYST (R.C.), or REAC-
TION PRODUCT (R.P.), were predicted to be OTHER COMPOUND (O.C.).
This may be because the models were able to predict that certain spans con-
tained chemical named, but were too general and unable to predict the specific
label. Additionally, spans annotated as OTHER COMPOUND (O.C.) were con-
sistently predicted to be more specific types of compounds. It seems that while
the models are able to predict which spans contain chemical compounds, they
are less able to distinguish between the types of compounds.

4.2   Task 2: Event Extraction
Results. Tables 8 - 10 show the exact match precision, recall, and F1 scores
obtained over the testing set for each of our three runs. Run 1 used our RelEx’s
CNN-based system trained over the ChemPatent embeddings with the trigger
words identified using medaCy’s biLSTM+CRF trained over the ChemPatent
embeddings. Run 2 used our RelEx’s rule-based system with the trigger words
identified using medaCy’s biLSTM+CRF trained with ChemPatent embeddings.
Run 3 used our rule-based system with the trigger words identified using medaCy’s
biLSTM+CRF trained with WikiPubmed embeddings. Table 11 shows the com-
parison with the co-occurrence baseline provided by the organizers of the ChEMU
challenge and the overall results from each of our runs.
    The overall results show that all three runs obtain a higher precision and
F1 score than the baseline but not recall. The system results show that the
CNN-based (Run 1) model obtains a higher overall F1 score than both the rule-
based (Run 2 & 3) models. When training with CNN the overall precision of
the predictions is high but the recall is low, this shows that CNN failed to
classify all instances but was able to classify most of the predicted instances
correct. Also, we can see the performance of each event class (Trigger word-
Entity pair) in Run 1 is proportional to the number of instances in the train-
ing set. For example, event classes, REACTION STEP-REAGENT CATALYST
Table 8: Run1: Precision (P), Recall (R) and F1 results using CNN-based sys-
tem with trigger words identified using medaCy trained with CheMU patent
embeddings
  Argument   Trigger            Entity              # Train   P       R      F1
                                OTHER COMPOUND        161     0.00    0.00   0.00
                                REACTION PRODUCT     1101     0.92    0.96   0.94
             REACTION STEP      REAGENT CATALYST     1272     0.78    0.69   0.74
                                SOLVENT              1134     0.64    0.74   0.69
  ARG1                          STARTING MATERIAL    1747     0.82    0.43   0.56
                                OTHER COMPOUND       4097     0.73    0.29   0.42
                                REACTION PRODUCT       11     0.00    0.00   0.00
             WORKUP
                                SOLVENT                 4     0.00    0.00   0.00
                                STARTING MATERIAL       4     0.00    0.00   0.00
                                TEMPERATURE           813     0.83    0.30   0.44
                                TIME                  839     0.78    0.73   0.75
             REACTION STEP
                                YIELD OTHER          1043     0.93    0.96   0.95
  ARGM                          YIELD PERCENT         937     0.91    0.94   0.92
                                TEMPERATURE           242     0.56    0.08   0.14
             WORKUP
                                TIME                   81     0 .00   0.00   0.00
                             System                           0.81    0.54   0.65


Table 9: Run 2: Precision (P), Recall (R) and F1 results using rule-based sys-
tem with trigger words identified using medaCy trained with CheMU patent
embeddings
  Argument   Trigger            Entity              # Train   P       R      F1
                                OTHER COMPOUND        161     0.02    0.63   0.04
                                REACTION PRODUCT     1101     0.82    0.78   0.80
             REACTION STEP      REAGENT CATALYST     1272     0.52    0.35   0.42
                                SOLVENT              1134     0.81    0.55   0.65
  ARG1                          STARTING MATERIAL    1747     0.63    0.31   0.41
                                OTHER COMPOUND       4097     0.90    0.86   0.88
                                REACTION PRODUCT       11     0.01    1.00   0.02
             WORKUP
                                REAGENT CATALYST        -     0.00    0.00   0.00
                                SOLVENT                 4     0.07    1.00   0.14
                                STARTING MATERIAL       4     0.04    1.00   0.08
                                TEMPERATURE           813     0.77    0.89   0.83
                                TIME                  839     0.85    0.93   0.89
             REACTION STEP
                                YIELD OTHER          1043     0.83    0.80   0.81
  ARGM                          YIELD PERCENT         937     0.86    0.85   0.85
                                TEMPERATURE           242     0.66    0.81   0.73
             WORKUP
                                TIME                   81     0.36    0.53   0.43
                                YIELD OTHER             2     0.00    0.00   0.00
                                YIELD PERCENT           1     0.00    0.00   0.00
                             System                           0.51    0.72   0.60


and REACTION STEP-STARTING MATERIAL, have more training instances
and obtain a high F1 score, whereas the event classes, WORKUP-SOLVENT
and WORKUP-STARTING MATERIAL, have a very few instances and obtain
an F1 score of zero.
   The rule-based models (Run 2 & 3) obtain comparatively high recall and
low precision. The rule-based methods predicts all the closest occurrences of
the trigger words of the entity compounds in the traversal area, however many
Table 10: Run 3: Precision (P), Recall (R) and F1 results using rule-based system
with trigger words identified using medaCy trained with WikiPubmed embed-
dings
  Argument    Trigger              Entity                 # Train   P      R      F1
                                   OTHER COMPOUND           161     0.02   0.63   0.04
                                   REACTION PRODUCT        1101     0.82   0.78   0.80
              REACTION STEP        REAGENT CATALYST        1272     0.52   0.35   0.42
                                   SOLVENT                 1134     0.81   0.54   0.65
  ARG1                             STARTING MATERIAL       1747     0.62   0.30   0.40
                                   OTHER COMPOUND          4097     0.90   0.86   0.88
                                   REACTION PRODUCT          11     0.01   1.00   0.02
              WORKUP
                                   REAGENT CATALYST           -     0.00   0.00   0.00
                                   SOLVENT                    4     0.07   1.00   0.13
                                   STARTING MATERIAL          4     0.03   1.00   0.07
                                   TEMPERATURE              813     0.85   0.89   0.82
                                   TIME                     839     0.78   0.93   0.89
              REACTION STEP
                                   YIELD OTHER             1043     0.82   0.80   0.81
  ARGM                             YIELD PERCENT            937     0.86   0.85   0.85
                                   TEMPERATURE              242     0.61   0.85   0.71
              WORKUP
                                   TIME                      81     0.36   0.60   0.45
                                   YIELD OTHER                2     0.00   0.00   0.00
                                   YIELD PERCENT              1     0.00   0.00   0.00
                               System                               0.51   0.71   0.59


                        Table 11: Task 2 Baseline evaluation
                                         P        R      F1
                            Run 1       0.81   0.54    0.65
                            Run 2       0.51   0.72    0.60
                            Run 3       0.51   0.71    0.59
                            Baseline    0.38   0.89    0.38


predictions are false positives. Since the number of instances in the training set
does not affect the rule-based methods, the performance of the event classes that
have few instances performs better. For example, the event classes, WORKUP-
TIME and REACTION STEP-OTHER COMPOUND, obtained zero F1 score
with CNN-based model but performed better with the rule-based models ob-
taining F1 scores of 0.43 and 0.88, respectively.
    Table 12 shows the arithmetic mean and weighted arithmetic mean of the
precision, recall, and F1 score for both trigger word classes for each run. Bold
terms indicate the best performance for each trigger word. We can see the CNN-
based method (Run 1) performs well with the REACTION STEP classes and
poor with WORKUP classes. This is because most of the REACTION STEP
classes have more instances for the CNN to train on but most of the WORKUP
classes have few instances. This is the same reason the rule-based methods (Run
2 & 3) perform better with those classes. The weighted arithmetic mean results
contradict with the arithmetic mean results, as we can see a notable difference in
the F1 score when comparing the classes of REACTION STEP and WORKUP.
The WORKUP event class obtains a better performance due to the significant
imbalance between the individual event classes. The weighted arithmetic mean
Table 12: Arithmetic and Weighted arithmetic mean of the performance of the
trigger words for each run
                                     Arithmetic mean       Weighted arithmetic mean
      Trigger         Entity
                                   P        R       F1      P       R          F1
                      Run 1       0.73     0.64    0.67    0.81    0.69     0.73
  REACTION STEP       Run 2       0.68     0.68    0.63    0.73    0.63     0.66
                      Run 3       0.68     0.67    0.63    0.73    0.63     0.65
                      Run 1       0.14     0.04    0.06    0.70    0.28     0.40
  WORKUP              Run 2       0.23     0.58    0.25    0.87    0.85     0.86
                      Run 3       0.22     0.59    0.25    0.87    0.85     0.86


allocates more weight to the classes that have more instances and vice versa we
see an improvement in the performance of both classes.

Error Analysis. Tables 13 and 14 show a detailed error analysis of the CNN-
based (Run 1) and the rule-based method (Run 2) respectively where the trigger
words are trained with ChemPatent embeddings. Here we report the number of
true positives (tp), false positives (fp), and false negatives (fn) and also ”fpm”
and ”fnm”, two metrics that represent the number of false positives and false
negatives, of which the corresponding entities are missing.


Table 13: Error analysis for the CNN model trained with ChemPatent embed-
dings
  Argument         Trigger                Entity           tp     fp      fn    fpm    fnm
                                   OTHER COMPOUND           0      0      63      0     11
                                   REACTION PRODUCT        436    36      16     11      3
                REACTION STEP      REAGENT CATALYST        350    97     155     17      8
                                   SOLVENT                 316   179     111     16      7
                                   STARTING MATERIAL       305    68     406     12      9
    ARG1
                                   OTHER COMPOUND          516   192    1234     23     73
                                   REACTION PRODUCT         0      0       4      0      0
                WORKUP             REAGENT CATALYST         -      -       -      -      -
                                   SOLVENT                  0      0       2      0      0
                                   STARTING MATERIAL        0      0       1      0      0
                                   TEMPERATURE             151    30     352     15     15
                                   TIME                    300    87     113     16     10
                REACTION STEP
                                   YIELD OTHER             418    31      17     11      3
                                   YIELD PERCENT           361    36      23     13      3
    ARGM
                                   TEMPERATURE              9      7     101      0     20
                                   TIME                     0      0      43      0     13
                WORKUP
                                   YIELD OTHER              -      -       -      -      -
                                   YIELD PERCENT            -      -       -      -      -
                         System                           3162   763    2641     134    175


    The results are consistent with the previous observations from the tables
8, 9 and 10. We can see REACTION STEP classes performed better than the
WORKUP classes. It is safe to say that, class imbalance plays a significant role
in the miss-annotation of the instances. The results also show that the rule-
based model significantly over annotates given the number of false positives. For
example, the rule-based model (Run 2) identified 379 instances of the WORKUP-
REACTION PRODUCT event class with only four being true positives.
Table 14: Error analysis for the rule-based model where trigger words are trained
with ChemPatent embeddings
    Argument      Trigger               Entity        tp      fp     fn   fpm   fnm
                                 OTHER COMPOUND        40   1798     23    18    11
                                 REACTION PRODUCT     351    75     101    10     3
               REACTION STEP     REAGENT CATALYST     177    162    328     8     8
                                 SOLVENT              234    54     193     4     7
                                 STARTING MATERIAL    217    128    494    15     9
     ARG1
                                 OTHER COMPOUND      1501    171    249    54    73
                                 REACTION PRODUCT       4   375       0     9     0
               WORKUP            REAGENT CATALYST       0    40       0     9     0
                                 SOLVENT                2    25       0     5     0
                                 STARTING MATERIAL      1    24       0     2     0
                                 TEMPERATURE          450    131     53    29    15
                                 TIME                 386    66      27    21    10
               REACTION STEP
                                 YIELD OTHER          350    74      85    11     3
                                 YIELD PERCENT        326    55      58    11     3
     ARGM
                                 TEMPERATURE           89    45      21    13    20
                                 TIME                  23    41      20    16    13
               WORKUP
                                 YIELD OTHER            0   367       0    10     0
                                 YIELD PERCENT          0   325       0     8     0
                        System                       4151   3957   1652   421   175


5     Conclusion


We trained three biLSTM+CRF models over different pre-trained word embed-
dings, as well as differently sized datasets. Results show that while these models
did not outperform the baseline model when evaluating exact span matches, the
models outperformed the baseline when evaluating in relaxed mode. A model
trained using word embeddings trained over chemical patents performed best
when evaluating in relaxed mode, while a model trained using biomedical word
embeddings and a combination of the training and development datasets per-
formed best when evaluated on exact span matches. Errors primarily occurred
because of issues with the model distinguishing between different entity labels,
such as models mislabeling entities annotated as OTHER COMPOUND for
more specific labels, like REACTION PRODUCT or STARTING MATERIAL.
Additionally, the way that MedaCy predicts entity labels may have contributed
to errors with labeling entity spans fully. Future work will focus on better dis-
tinguishing between different types of chemical compounds, as well as looking
into models based on language models.
    We used one CNN-based model and two rule-based models to extract events
and according to the results, all three models outperformed the baseline model.
Results show that the CNN-based method outperforms the rule-based meth-
ods, especially with the REACTION STEP classes as those classes have more
instances to train on. Meanwhile, as the rule-based methods do not require train-
ing instances to train they perform better with WORKUP classes. In the future,
we plan to explore building a hybrid model with both CNN and rule-based
methods to increase the performance.
References
 1. Bort, W., Baskin, I.I., Sidorov, P., Marcou, G., Horvath, D., Madzhidov, T.,
    Varnek, A., Gimadiev, T., Nugmanov, R., Mukanov, A.: Discovery of novel chem-
    ical reactions by deep generative recurrent neural network (2020)
 2. Charles, P.: Project title. https://github.com/charlespwd/project-title
    (2013)
 3. Gridach, M.: Character-level neural network for biomedical named entity recogni-
    tion. Journal of biomedical informatics 70, 85–91 (2017)
 4. Huang, Z., Xu, W., Yu, K.: Bidirectional lstm-crf models for sequence tagging.
    arXiv preprint arXiv:1508.01991 (2015)
 5. Leaman, R., Gonzalez, G.: Banner: an executable survey of advances in biomedical
    named entity recognition. In: Biocomputing 2008, pp. 652–663. World Scientific
    (2008)
 6. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed represen-
    tations of words and phrases and their compositionality. In: Advances in neural
    information processing systems. pp. 3111–3119 (2013)
 7. Nguyen, D.Q., Zhai, Z., Yoshikawa, H., Fang, B., Druckenbrodt, C., Thorne, C.,
    Hoessel, R., Akhondi, S.A., Cohn, T., Baldwin, T., et al.: Chemu: Named entity
    recognition and event extraction of chemical reactions from patents. In: European
    Conference on Information Retrieval. pp. 572–579. Springer (2020)
 8. Nguyen, T.H., Grishman, R.: Relation extraction: Perspective from convolutional
    neural networks. In: Proceedings of the 1st Workshop on Vector Space Modeling
    for Natural Language Processing. pp. 39–48 (2015)
 9. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
    Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z.,
    Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.:
    Pytorch: An imperative style, high-performance deep learning library. In: Wallach,
    H., Larochelle, H., Beygelzimer, A., d Alche-Buc, F., Fox, E., Garnett, R. (eds.)
    Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran
    Associates, Inc. (2019)
10. Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., Ananiadou, S.: Distributional
    semantics resources for biomedical text processing. In: The 5th International Sym-
    posium on Languages in Biology and Medicine (2013)
11. Wang, K., Wang, L., Yuan, Q., Luo, S., Yao, J., Yuan, S., Zheng, C., Brandt, J.:
    Construction of a generic reaction knowledge base by reaction data mining. Journal
    of Molecular Graphics and Modelling 19(5), 427–433 (2001)
12. Yoshikawa, H., Nguyen, D.Q., Zhai, Z., Druckenbrodt, C., Thorne, C., Akhondi,
    S.A., Baldwin, T., Verspoor, K.: Detecting chemical reactions in patents (2019)