Melaxtech: A report for CLEF 2020 – ChEMU Task of Chemical Reaction
Extraction from Patent


                Jingqi Wang1 Yuankai Ren2 Zhi Zhang2 and Yaoyun Zhang1
                       1
                        Melax Technologies, Inc, Houston, TX, USA
           2
               Nantong University School of Medicine, Nantong, Jiangsu, China
                             Yaoyun.Zhang@melaxtech.com


         Abstract. This work describes the participation of the Melaxtech team
         in the CLEF 2020 – ChEMU Task of Chemical Reaction Extraction from
         Patent. The task consisted of two subtasks: (1) Named entity recognition
         to identify compounds and different semantic roles in the chemical reac-
         tion. (2) Event extraction to identify event-triggers of chemical reaction
         and their relations with the semantic roles recognized in subtask 1. We
         developed hybrid approaches combining both deep learning models and
         pattern-based rules for this task. Our approaches achieved state-of-art re-
         sults in both subtasks, with the best F1 of 0.957 for entity recognition and
         the best F1 of 0.9536 for event extraction, indicating the proposed ap-
         proaches are promising.

         Keywords: named entity recognition, event extraction, chemical reac-
         tion.


1        Introduction1

New compound discovery plays a vital role in the chemical and pharmaceutical indus-
try.[1] Characteristics of compounds, such as their reactions and experimental condi-
tions are fundamental information for chemical research and applications.[2] The latest
information of chemical reactions is usually present in patents, and is embedded in free
text.[3] The rapidly accumulating chemical patents urge automatic tools based on nat-
ural language processing (NLP) techniques for efficient and accurate information ex-
traction.[4]

Fortunately, the CLEF 2020 – ChEMU Task takes the initiative to promote the chemi-
cal reaction extraction from patent by providing benchmark annotation datasets. Two


    copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
    License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020,
    Thessaloniki, Greece.
important subtasks are setup in this challenge: chemical named entity recognition
(NER) and chemical reaction event extraction. In particular, the annotation scheme of
this benchmark data extends from previous challenges of chemical information extrac-
tion[5][6] to recognize multiple semantic roles of chemical substances in the reaction.
Moreover, keywords event triggers and their relations with each semantic role are also
annotated and provided for this task. The CLEF 2020 – ChEMU Task will greatly fa-
cilitate the development of automatic NLP tools for chemical reaction in patents with
community efforts.[7][8]

This work describes our participation in both subtasks in CLEF 2020 – ChEMU. We
developed hybrid approaches combining deep learning models and pattern-based rules
for the information extraction systems. Our approaches achieved top rank in both sub-
tasks, with the best F1 of 0.957 for entity recognition and the best F1 of 0.9536 for
event extraction, indicating the proposed approaches are promising.


2      Methods

Dataset: The dataset provided by the CLEF 2020 – ChEMU Task was split into training
data, development data and test data for open evaluation. For subtask 1, it was annotated
with 10 entity type labels describing different semantic roles in chemical reaction, in-
cluding EXAMPLE_LABEL, STARTING_MATERIAL, REAGENT_CATALYST,
REACTION_PRODUCT, SOLVENT, TIME, TEMPERATURE, YIELD_PERCENT,
YIELD_OTHER, and OTHER_COMPOUND. For subtask 2, the event trigger words
(such as "addition" and "stirring") were annotated, which were further split into labels
of "REACTION_SETUP" and “WORK_UP”. Their relations with different semantic
roles were also annotated. Following the semantic proposition definition, the Arg1 type
was used to mark the relation between event trigger words and compounds. ArgM rep-
resented the auxiliary role of the event and was used to mark the relation between the
trigger word and the temperature, time or output entity.

Information extraction
Pre-processing In this step, patent text is segmented into sentences by sentence
boundary detection. The tokens are also identified and separated by a tokenization tool
based on lexicons and regular expressions. Modules of sentence segmentation and
tokenization in the CLAMP software[9] were applied in this study.


Pre-training language model on patents Diverse expressions of chemical reaction infor-
mation in the free text make them very sparse to be represented and modeled.[10] The
semantic distributed representations (i.e. multi-dimension vectors of float values) of
text generated by deep neural networks, or deep learning methods, alleviated the sparse-
ness challenge by dramatically reducing the dimensions of language representation vec-
tors using a non-linear space.[10] Specifically, language models pre-trained on large
scale unlabeled datasets embed linguistic and domain knowledge that can be transferred
to downstream tasks, such as NER and relation extraction.[11] In this study, Bi-
oBERT,[12] a pre-trained biomedical language model (a bidirectional encoder repre-
sentation for biomedical text), was used as the basis for training a language model of
patents. Based on Bert,[13] a language model pre-trained on large scale open text, Bi-
oBERT was further refined on using the biomedical literature in PubMed and PMC.
Consequently, BioBERT outperforms Bert on a series of benchmark tasks for biomed-
ical NER and relation extraction.[12] For this study, BioBERT was retrained using text
files provided by CLEF 2020 – ChEMU to tailor the language model to patent data. For
convenience, the pre-trained language model is referred as Patent_BioBERT.

Subtask 1 - Named entity recognition Semantic roles in chemical reactions are recog-
nized using a hybrid method. First, Patent_BioBERT was fine-tuned using the Bi-
LSTM-CRF (Bi-directional Long-Short-Term-Memory Conditional-Random-Field)
algorithm. Next, several pattern-based rules were designed based on manual observa-
tion of the training and development datasets and used in a post-processing step. For
example, rules were defined to differentiate STARTING_MATERIAL and
OTHER_COMPOUND based on the relative positions and total number of
EXAMPLE_LABEL labels. Specifically, the essential logic to determine whether the
chemical mentions at the beginning of a text are STARTING_MATERIAL or
OTHER_COMPOUND is actually to determine whether there is a hierarchical struc-
ture in the narrative to describe multiple steps of chemical reactions. If there are multi-
ple example_labels present in the text, chemical mentions at the beginning of the entire
text are usually the final chemical to be produced and should be labeled as
OTHER_COMPOUND, while chemical mentions at the beginning of each later exam-
ple label are STARTING_MATERIAL in each sub-step of chemical reactions.


Subtask 2 - Event extraction: This subtask contains two steps. For the step of event
trigger detection, it was also a NER task, and was addressed with a similar approach as
in subtask one. For the relation extraction task, given entities annotated in sentences, it
can be transformed into a classification problem. A classifier can be built to determine
categories of all possible candidate relation pairs (e1, e2), where entities e1 and e2 are
from the same sentence. We generated candidate pairs by pairing each event trigger and
semantic role. In order to represent a candidate event trigger and semantic role pair in
an input sentence, we used the semantic type of an entity to replace the entity itself. The
mentions of entities are directly generalized by their semantic types in the sentences. A
linear classification layer was added on top of the Patent_BioBERT model to predict
the label of a candidate pair in sentential context. As mentioned above,
Patent_BioBERT was essentially built on the basis of BERT. In detail, BERT adds a
classification token [CLS] at the beginning of a sentence input, whose output vector
was used for classification. As typical with BERT, we used a [CLS] vector as input to
the linear layer for classification. Then a softmax layer was added to output labels for
the sentence. Furthermore, some event triggers and their linked semantic roles were
present in different sentences, or different clauses in long complex sentences. Their
relations were not identified using the deep learning-based model. Therefore, post-
processing rules were designed based on patterns observed in the training data, and
applied to recover some of these false negative relations.
Subtask 2 – End-2-End: Overall, a typical cascade, or pipeline model was built for the
end-2-end system, in which semantic roles and event triggers were first recognized
together in a NER model, their relations were then classified in a relation extraction
model.

Evaluation
Precision, recall and F1 were used for performance evaluation, as defined in Equations
1-3. Both exact and relax matching results are reported. The primary evaluation metric
was the F1 score of exact matching. We used 10-fold cross-validation on the merged
training and development datasets to optimize parameters for the models.


                                   !"#$ &'()!)*$(
         precision = !"#$ &'()!)*$( + +,-($ &'()!)*$(                               (1)
                              !"#$ &'()!)*$(
         recall = !"#$ &'()!)*$( + +,-($ .$/,!)*$(                                 (2)
                 0 ´ &"$1)()'. ´ "$1,--
         F1 =                                                                       (3)
                  &"$1)()'. + "$1,--


10-fold cross-validation was conducted using the training process. The final set of
hyperparameters and values used in the study are dropout_rate 0.2, max_seq_length
310, hidden_dim 128, learning rate 5e-5, batch-size 24. Based on this, we implemented
three approaches for comparison:
1. Fine-tuning Patent_BioBert: Among the 10 models generated in the 10-fold cross-
validation, the model with the highest performance on 1-fold was selected and used for
submission.
2. Ensemble of output results from 10 models generated from the cross-validation using
majority voting. Due to time limitation, no complex methods or optimization were
made for the ensemble model. A simple majority voting result based on outputs from
the 10 models was used for the ensemble submission.
3. Merge-data: fine-tuning Patent_BioBert using the merged training and development
datasets.


3      Results

Performances on the test dataset for each task, namely NER, event extraction and the
end-to-end information extraction are illustrated in Table 1-3, respectively. Promising
results were obtained for the current approaches, with the highest F1 of 0.957 for NER,
0.9536 for event extraction and 0.9174 for the end-to-end system, respectively.

Moreover, the detailed performances of the fine-tuning method for each entity types
and relation types, and their overall performances on the development set are reported
in Table 4-5. The overall F1 is 0.942 for NER and 0.953 for relation extraction.
However, the ensemble systems and systems built on the merged data of training and
development sets yielded lower performances than the system fine-tuned on a 90%-
10% split of the gold standard data. Especially, the Merge-data systems got the lowest
performances among all three approaches. One potential reason was that the hyperpa-
rameters used in the fine-tuning model were not the optimal set for the merged data.
Due to time limitation, only majority voting was applied in the ensemble systems, more
investigations into this direction are needed in our future work.

Unfortunately, sharp drops are obtained in the end-2-end performances of two proposed
methods - Ensemble and Merge-data (Table 3). After checking the workflow, unex-
pected errors happened in these two runs in the post-processing stage. The lexicon of
WORK_UP was mistakenly used to boost the recall of WORK_UP by recovering miss-
ing mentions. Since WORK_UP and Reaction_Step share many words, a large part of
Reaction_Step were also replaced by WORK_UP. Among the three runs, the pipelines
of Ensemble and Merge-data had mislabeled WORK_UP, and got poor performances
(F-measures drop from above 90% from basic Fine-tuning method to around 20%).
Luckily, this mistake was detected and fixed in our later submissions for relation ex-
traction.

Table 1. Performances of named entity extraction for chemical reaction. Both exact
and relaxed matching results are reported.

    Method        Exact                           Relax

                  Precision   Recall   F1         Precision   Recall      F1

    Fine-tuning   0.9571      0.957    0.957      0.969       0.9687      0.9688

    Ensemble      0.9587      0.9529   0.9558     0.9697      0.9637      0.9667

    Merge-data    0.9572      0.951    0.9541     0.9688      0.9624      0.9656


Interestingly, performances of the exact and relaxed matching criteria did not have
sharp differences, which indicated that limited boundary errors occurred in the NER
step. This validated that the preprocessing modules in CLAMP could efficiently seg-
ment sentences and split tokens.


4         Discussion

Novel compound discovery is vital in the chemical and pharmaceutical industry. Chem-
ical reaction is essential to rigorous understanding of compound for further research
and applications.
Table 2. Performances of event extraction for chemical reaction. Both exact and re-
laxed matching results are reported.

 Method        Exact                                 Relax

               Precision   Recall       F1           Precision    Recall       F1

 Fine-tuning    0.9568     0.9504       0.9536       0.958        0.9516       0.9548

 Ensemble       0.9619     0.9402       0.9509       0.9632       0.9414       0.9522

 Merge-data     0.9522     0.9437       0.9479       0.9534       0.9449       0.9491


Table 3. Performances of end-to-end systems for chemical reaction extraction. Both
exact and relaxed matching results are reported.

 Method        Exact                                 Relax

               Precision   Recall       F1           Precision    Recall       F1

 Fine-tuning    0.9201     0.9147       0.9174       0.9319       0.9261       0.9290

 Ensemble       0.2394     0.2647       0.2514       0.2429       0.2687       0.2552

 Merge-data     0.2383     0.2642       0.2506       0.2421       0.2684       0.2545

Our participation in the CLEF 2020 – ChEMU task answers to the urgent call for high-
quality information extraction tools for chemical reaction information in patents. Eval-
uation based on the open test dataset demonstrated that the proposed hybrid approaches
are promising, with top ranks in the two subtasks. Valuable lessons are also learned in
this process:


A detailed error analysis was conducted for future system improvement. One major
type of errors was the confusion between REAGENT_CATALYST and
STARTING_MATERIAL or between REAGENT_CATALYST and SOLVENT. In-
formation structures in sentences and context were not sufficient to differentiate these
semantic types. Another major error was related to the event trigger recognition. Many
false positive event triggers were recognized, and REACTION_STEP and WORKUP
were often confused with each other, especially for words frequently present in differ-
ent contexts (e.g., added, stirring). Failing to recognize named entities correctly also
affected the next relation extraction step. As for relation extraction, the majority of er-
rors were caused by long distance relations intra or inter sentences. Although rules were
applied to fix such errors, they also brought false positive instances. The precision and
recall were examined carefully and balanced for each rule, only rules that could im-
prove the performance with high confidences were kept in the system.

Table 4. Performances of named entity extraction for chemical reaction on the devel-
opment set. Both performances on each entity type and the overall performance are
reported. Performances of the fine-tuning method is reported.

 Entity type                                                      Exact

                                                  Precision      Recall      F1

 EXAMPLE_LABEL                                       0.979       0.986       0.982

 REACTION_PRODUCT                                    0.899       0.904       0.902

 REACTION_STEP                                       0.952       0.944       0.948

 STARTING_MATERIAL                                   0.896       0.926       0.911

 YIELD_OTHER                                         0.99        0.965       0.977

 YIELD_PERCENT                                       0.972       1           0.986

 REAGENT_CATALYST                                    0.938       0.905       0.921

 SOLVENT                                             0.963       0.93        0.946

 TEMPERATURE                                         0.935       0.96        0.947

 WORKUP                                              0.931       0.93        0.931

 OTHER_COMPOUND                                      0.947       0.939       0.943

 TIME                                                0.983       0.991       0.987

 Overall                                             0.943       0.941       0.942


The motivation behind the three implemented methods is that it is interesting to
examine if there is a space of performance improvement if majority voting or a larger
training dataset is used. The three methods actually shared with the same set of hyper-
parameters. The same set of hyper-parameters, based on our current interpretation, is a
curse to the final performances. The majority-voting and merge-data methods did not
generate better performances as originally expected. More investigations need to be
conducted for these two methods, with an additional validation set for fine-tuning. Yet,
the sensitivity of the hyper-parameters in deep learning models is a long-standing
problem that needs even more efforts to be alleviated.
Table 5. Performances of chemical reaction extraction on the development set. Both
performances on each relation type and the overall performance are reported.Perfor-
mances of the fine-tuning method is reported.

 Relation type                                                 Exact

                                            Precision          Recall       F1

 ARG1|REACTION_STEP|OTHER_COMPOUND             0.733           0.805        0.767

 ARG1|REACTION_STEP|REACTION_PRODUCT           0.985           0.948        0.966

 ARG1|REACTION_STEP|REAGENT_CATALYST           0.979           0.965        0.972

 ARG1|REACTION_STEP|SOLVENT                    0.975           0.9522       0.968

 ARG1|REACTION_STEP|STARTING_MATERIAL          0.957           0.916        0.936

 ARG1|WORKUP|OTHER_COMPOUND                    0.965           0.961        0.963

 ARG1|WORKUP|REACTION_PRODUCT                  0               0            0

 ARG1|WORKUP|SOLVENT                           0.2             1            0.333

 ARG1|WORKUP|STARTING_MATERIAL                 0               0            0

 ARGM|REACTION_STEP|TEMPERATURE                0.957           0.928        0.942

 ARGM|REACTION_STEP|TIME                       0.978           0.926        0.952

 ARGM|REACTION_STEP|YIELD_OTHER                0.984           0.942        0.962

 ARGM|REACTION_STEP|YIELD_PERCENT              0.982           0.943        0.962

 ARGM|WORKUP|TEMPERATURE                       0.893           0.909        0.901

 ARGM|WORKUP|TIME                              0.7             1            0.824

 ARGM|WORKUP|YIELD_OTHER                       0               0            0

 ARGM|WORKUP|YIELD_PERCENT                     0               0            0

 Overall                                       0.963           0.944        0.953


Comparisons between the performances with and without post-processing rules showed
that the applied rules only contribute to slight improvements to the overall perfor-
mances on the development set (NER: 0.9389 vs. 0.9421; Relation: 0.9526 vs. 0.9534),
despite careful data analysis was conducted to find potential improvements from heu-
ristics. This demonstrated the generalizability power of the pre-trained language model,
and also indicated that more investigations are needed for heuristics and knowledge-
based improvement.

Limitations and future work: Although the proposed approaches obtained promising
performances of chemical reaction extraction, there are several limitations and further
improvements in next steps. (1) Firstly, domain knowledge of different semantic roles
and their relations was not leveraged in the current study, such as lexicons of
REAGENT_CATALYST and SOLVENT. This may potentially resolve the confusion
among different semantic labels. (2) Secondly, dependency syntactic information was
not applied in the current approaches, such as conjunctive structures and header-de-
pendent patterns. Such information was proved to be effective for relation extraction
and would be integrated into the deep learning models to further improve the perfor-
mance. (3) The currently rules to fix errors in event triggers were data driven, which
appeared to be ad hoc given the limited gold standard dataset. Data argumentation ap-
proaches[14] will be applied in the next step to enrich the training data and the coverage
of different context patterns, so as to make a clearer differentiation among event trig-
gers.


5      Conclusions

This work describes the participation of the Melaxtech team on the CLEF 2020 –
ChEMU Task of Chemical Reaction Extraction from Patent. We developed hybrid ap-
proaches combining both deep learning models and pattern-based rules for this task.
Our approaches achieved state of the art results in both subtasks, indicating the pro-
posed approaches are promising. Further improvement will also be conducted in the
near future by integrating domain knowledge and syntactic features into the current
framework. Data augmentation will also be investigated for annotation enrichment in a
cost-saving way, to further improve the system generalizability.


References
1. Akhondi, S. A., Rey, H., Schwörer, M., Maier, M., Toomey, J., Nau, H., Bobach,
    C. (2019). Automatic identification of relevant chemical compounds from pa-
    tents. Database, 2019.
2. Akhondi, S. A., Klenner, A. G., Tyrchan, C., Manchala, A. K., Boppana, K.,
    Lowe, D., Kors, J. A. (2014). Annotated chemical patent corpus: a gold standard
    for text mining. PloS one, 9(9), e107477.
3. Senger, S., Bartek, L., Papadatos, G., & Gaulton, A. (2015). Managing expecta-
    tions: assessment of chemistry databases generated by automated extraction of
    chemical structures from patents. Journal of cheminformatics, 7(1), 1–12.
4. Muresan, S., Petrov, P., Southan, C., Kjellberg, M. J., Kogej, T., Tyrchan, C., Xie,
    P. H. (2011). Making every SAR point count: the development of Chemistry
    Connect for the large-scale integration of structure and bioactivity data. Drug
    Discovery Today, 16(23–24), 1019–1030.
5. Krallinger, M., Leitner, F., Rabal, O., Vazquez, M., Oyarzabal, J., & Valencia, A.
    (2015). CHEMDNER: The drugs and chemical names extraction challenge. Jour-
    nal of cheminformatics, 7(S1), S1.
6. Krallinger, M., Rabal, O., Lourenço, A., Perez, M. P., Rodriguez, G. P., Vazquez,
    M., Valencia, A. (2015). Overview of the CHEMDNER patents task. In Pro-
    ceedings of the fifth BioCreative challenge evaluation workshop (pp. 63–75).
7. Nguyen, D. Q., Zhai, Z., Yoshikawa, H., Fang, B., Druckenbrodt, C., Thorne, C.,
    Verspoor, K. (2020). ChEMU: Named Entity Recognition and Event Extraction
    of Chemical Reactions from Patents. In J. M. Jose, E. Yilmaz, J. Magalhães, P.
    Castells, N. Ferro, M. J. Silva, & F. Martins (Eds.), Advances in Information Re-
    trieval (pp. 572–579). Cham: Springer International Publishing.
    https://doi.org/10.1007/978-3-030-45442-5_74
8. He, J., Nguyen, D. Q., Akhondi, S. A., Druckenbrodt, C., Thorne, C., Hoessel, R.,
    Verspoor, K. (2020). Overview of ChEMU 2020: Named Entity Recognition and
    Event Extraction of Chemical Reactions from Patents. In A. Arampatzis, E. Ka-
    noulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, N. Ferro (Eds.), Experi-
    mental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of
    the Eleventh International Conference of the CLEF Association (CLEF 2020)
    (Vol. 12260). Lecture Notes in Computer Science.
9. Soysal, E., Wang, J., Jiang, M., Wu, Y., Pakhomov, S., Liu, H., & Xu, H. (2018).
    CLAMP – a toolkit for efficiently building customized clinical natural language
    processing pipelines. Journal of the American Medical Informatics Association,
    25(3), 331–336. https://doi.org/10.1093/jamia/ocx132
10. Camacho-Collados, J., & Pilehvar, M. T. (2018). From word to sense embed-
    dings: A survey on vector representations of meaning. Journal of Artificial Intel-
    ligence Research, 63, 743–788.
11. Liu, F., Chen, J., Jagannatha, A., & Yu, H. (2016). Learning for biomedical in-
    formation extraction: Methodological review of recent advances. arXiv preprint
    arXiv:1606.07993.
12. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). Bi-
    oBERT: a pre-trained biomedical language representation model for biomedical
    text mining. Bioinformatics, 36(4), 1234–1240.
13. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of
    deep bidirectional transformers for language understanding. arXiv preprint
    arXiv:1810.04805.
14. Xu, Y., Jia, R., Mou, L., Li, G., Chen, Y., Lu, Y., & Jin, Z. (2016). Improved re-
    lation classification by deep recurrent neural networks with data augmentation.
    arXiv preprint arXiv:1601.03651.