=Paper= {{Paper |id=Vol-2696/paper_219 |storemode=property |title=Named Entity Recognition in Chemical Patents using Ensemble of Contextual Language Models |pdfUrl=https://ceur-ws.org/Vol-2696/paper_219.pdf |volume=Vol-2696 |authors=Jenny Copara,Nona Naderi,Julien Knafou,Patrick Ruch,Douglas Teodoro |dblpUrl=https://dblp.org/rec/conf/clef/CoparaNKRT20 }} ==Named Entity Recognition in Chemical Patents using Ensemble of Contextual Language Models== https://ceur-ws.org/Vol-2696/paper_219.pdf
    Named entity recognition in chemical patents
    using ensemble of contextual language models

    Jenny Copara1,2,3 , Nona Naderi1,2 , Julien Knafou1,2,3 , Patrick Ruch1,2 , and
                               Douglas Teodoro1,2
       1
           University of Applied Sciences and Arts of Western Switzerland, Geneva,
                                         Switzerland
                  2
                    Swiss Institute of Bioinformatics, Geneva, Switzerland
                        3
                          University of Geneva, Geneva, Switzerland
                             {firstname.lastname}@hesge.ch



           Abstract. Chemical patent documents describe a broad range of appli-
           cations holding key reaction and compound information, such as chemical
           structure, reaction formulas, and molecular properties. These informa-
           tional entities should be first identified in text passages to be utilized
           in downstream tasks. Text mining provides means to extract relevant
           information from chemical patents through information extraction tech-
           niques. As part of the Information Extraction task of the Cheminformat-
           ics Elsevier Melbourne University challenge, in this work we study the
           effectiveness of contextualized language models to extract reaction infor-
           mation in chemical patents. We assess transformer architectures trained
           on a generic and specialised corpora to propose a new ensemble model.
           Our best model, based on a majority ensemble approach, achieves an
           exact F1 -score of 92.30% and a relaxed F1 -score of 96.24%. The results
           show that ensemble of contextualized language models can provide an
           effective method to extract information from chemical patents.

           Keywords: Named-entity recognition, chemical patents, contextual lan-
           guage models, patent text mining, information extraction.


1      Introduction
Chemical patents represent a valuable information resource in downstream inno-
vation applications, such as drug discovery and novelty checking. However, the
discovery of chemical compounds described in patents is delayed by a few years
[12]. Among the reasons, it could be considered the complexity of the chemical
patent information sources [11], the recent increase in the number of chemical
patents without manual curation, and the particular wording used in the do-
main. Narratives in chemical patents contain often concepts expressed in a way
to protect or hide information, as opposed to scientific literature, for example,
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
where the text tends to be as clear as possible [34]. In this landscape, informa-
tion extraction methods, such as Named Entity Recognition (NER), provide a
suited solution to identify key information in patents.
    NER aims to identify information of interest and their respective instances
in a document [8, 24]. It has been often addressed as a sequence classification
task, where a sequence of features, usually tokens, is used to predict the class
of a text passage. One of the most successful approaches in sequence classifica-
tion is Conditional Random Fields (CRF) [18, 32]. CRF was proposed to solve
sequence classification problems by estimating the conditional probability of a
label sequence given a word sequence, considering a set of observed features in
the latter. It was established as the state-of-the-art in different NER domains
for many years [19, 29, 20, 28, 9, 11, 37]. In the chemical patent domain, CRF was
explored by Zhang et al. [39] in the CHEMDNER patent corpus [17]. Using a
set of hand-crafted and unsupervised features derived from word embeddings
and Brown clustering, their model achieved 87.22% of F1 -score. With similar
F1 -score performance, Akhondi et al. [2] explored CRF combined with dictio-
naries in the biomedical domain in the tmChem tool [20] in order to select the
best vocabulary for the CHEMDNER patent corpus. It has been shown [11] that
recognizing chemical entities in the full patent text is a harder task than in titles
and abstracts, due the peculiarities of the chemical patent text. Evaluation in
full patents was performed using BioSemantics patent corpus [1] through neu-
ral approaches based on the Bidirectional Long-Short Term Memory (BiLSTM)
CRF [10] and the BiLSTM Convolutional Neural Network (CNN) CRF [38] ar-
chitectures, with performance of 82.01% and 85.68% of F1 -score, respectively. It
is worth noting that for the first architecture [10], the authors used word2vec
embeddings [23] to represent features, while in the latter [38], the authors used
ELMo contextualized embeddings [26].
    Over the years, neural language models have improved their ability to encode
the semantics of words using large amounts of unlabeled text for self-supervised
training. They have initially evolved from a straightforward model [3] of one
hidden layer that predicts the next word in a sequence, aiming to learn the
distributed representation of words (i.e., the word embedding vector), to an im-
proved objective function that allows learning from larger amounts of text [4], us-
ing higher computational resources and with longer training time. These develop-
ments have encouraged the seeking of language models able to bring high-quality
word embeddings with lower computational cost (i.e., word2vec [23] and Global
Vectors (GloVe) [25]). However, natural language still presented challenges for
language models, in particular, concerning word contexts and homonyms. More
recently, a second type of word embeddings have attracted attention in the lit-
erature, the so-called contextualized embeddings, such as ELMo, UMLFiT [14],
GPT-2 [27], and BERT [7]. Particularly, the BERT architecture uses the atten-
tion mechanism to train deep bidirectional token representations, conditioning
tokens on their left and right contexts.
  In this work, we explore contextualized language models to extract infor-
mation in chemical patents as part of the Named Entity Recognition task of
the Information extraction from Chemical Patents (ChEMU) lab [12, 13]. Pre-
trained contextualized languages models, based on the BERT-based architec-
ture, are used as baseline model and fine-tuned on the examples of the ChEMU
NER task to classify tokens according to the different entities. In the chal-
lenge, the corpus was annotated with the entities: example label, other compound,
reaction product, reagent catalyst, solvent, starting material, temperature, time,
yield other, and yield percent. We investigate the combination of different archi-
tectures to improve NER performance. In the following sections, we describe the
design and results of our experiments.


2     Methods and data

2.1    NER model

Transformers with a token classification on top. We assess five language
models based on the transformers architecture to classify tokens according to
the named-entities classes. The first four models are variations of the BERT
model in terms of size and tokenization: bert-base-cased, bert-base-uncased, bert-
large-cased, and bert-large-uncased. These models were originally pretrained on
a large corpus of English text extracted from BookCorpus [40] and Wikipedia,
with different number of attention heads for the base and large types (12 and
16 respectively). The fifth pretrained language model assessed is ChemBERTa1 ,
a RoBERTa-based transformer architecture [22], trained on a corpus of 100k
Simplified Molecular Input Line Entry System (SMILES) [35] strings from the
ZINC benchmark dataset [15].
    Our models consist of BERT models specialised for NER, with a fully con-
nected layer on top of the hidden states of each token. They are fine-tuned on
the ChEMU Task 1 dataset, using the train and development sets provided. The
fine-tuning is performed with a sequence length of 256 tokens, a warmup pro-
portion of 0.1 (percentage of warmup steps with respect to the total amount of
steps), and a batch size of 32. The tokenization process is driven by the original
model’s tokenizer, i.e., for the BERT-based models, WordPiece [36] is applied,
while for the RoBERTa-based model, Byte-Pair-Encoding [30] is applied. The
Adam optimizer is employed to optimize network weights [16]. The first four
language models are fine-tuned for 10 epochs and a learning rate of 3e − 5. For
ChemBERTa model, we conduct a grid search over the development set and
found the best performance around 29 epochs of fine-tuning and a learning rate
of 4e − 5. The implementations are based on the Huggingface framework. 2


Ensemble model. Our ensemble method is based on a voting strategy, where
each model votes with its predictions and a simple majority of votes is necessary
to assign the predictions [5]. In other words, for a given document, our models
1
    https://github.com/seyonechithrananda/bert-loves-chemistry
2
    https://huggingface.co/transformers/
infer their predictions independently for each entity, then, a set of passages that
received at least a vote is taken into consideration for casting votes. This means
that, for a given document and a given entity, we end up with multiple passages
associated with a number of votes, then, again for a given entity, the ensemble
method will predict as positive all the passages that get the majority of votes.
Note that each entity is predicted independently and that the voting strategy
does allow the fact that a passage could have been labeled as positive for multiple
entities at once.
    Finally, in order to decide on the optimal composition of the ensemble model,
we used the development set and compute all possible ensemble predictions using
the above methodology. As we had 7 models in total, we tried every possible
combination from 2 to 7 models. We retained the ensemble composition with
the best overall F1 -score and used it for the test set. Originally, the ensemble
model giving the best F1 -score was combining bert-large-uncased, bert-base-cased,
CRF, bert-base-uncased and the CNN model (5 models). However, due to the
size of the test set (approximately 10k patent snippets), we had to discard the
large models of the ensemble strategy due to their much higher algorithmic
complexity and the time constraints. The retained models in the ensemble were
then bert-base-cased, bert-base-uncased and the CNN model.



Baseline. We consider two models for our baseline: CRF and CNN. For the CRF
model, a set of standard features in a window of ±2 tokens are created without
taking into account part-of-speech tags, neither gazetteers. The features used
are token itself, lower-cased word, capitalization pattern, type of token (i.e.,
digit, symbol, word), 1-4 character prefixes/suffixes, digit size (i.e., size 2 or
4), combination of values (digit with alphanumeric, hyphen, comma, period),
binary features for upper/lower-cased letter, alphabet/digit char and symbol.
Please refer to [6, 9] for further details on the features used. The CRF classifier
implementation relies on the CRFSuite.3
     The CNN model [21] for NER relies on incremental parsing with Bloom
embeddings, a compression technique for neural network models dealing with
sparse high-dimensional binary-coded instances [31]. The convolutional layers
use residual connections, layer normalization and maxout non-linearity. The in-
put sequence is embedded in a vector compounded by Bloom embeddings model-
ing the characters, prefix, suffix and part-of-speech of each word. Convolutional
filters of 1D are used over the text to predict how the next words are going
to change. Our implementation relies on the spaCy NER module, 4 using the
pretrained transformer bert-base-uncased for 30 epochs and a batch size of 4.
During the test phase, we fixed the max size of the text to 1.5M due to RAM
memory limitations.

3
    http://www.chokkan.org/software/crfsuite/
4
    https://spacy.io
2.2    Data
The data in ChEMU Task 1 (NER) is provided as snippets sampled from 170
English patents from the European Patent Office and the United States Patent
and Trademark Office [12, 13]. Gold annotations were provided for training (900
snippets) and development (250 snippets) sets for a total of 20, 186 entities. The
annotation was done in the BRAT standoff format. Fig. 1 shows an example of
a snippet with annotations for several entities, including reaction product (two
annotations), starting material and temperature.




         Fig. 1. Data example with annotations for the ChEMU NER task.


    During the development phase, we used the official development set as our
test set. The official training set was split into train and development sets in
order to train the weights and tune hyper parameters of our models, respec-
tively. As a result of this new setting, 800 snippets were available in train set,
100 in the development set and 225 in test set. Table 1 shows the entity distribu-
tion during the development phase. The majority of the annotations come from
other compound, reaction product and starting material, covering the 52% of en-
tities in the development phase. In contrast, example label, time and yield percent
entities represent 17% of entities in the development phase.

2.3    Evaluation metrics
The metrics used to evaluate the models are precision, recall, and F1 -score. As
it can be seen in the example of Fig. 1, each entity has a span that is expected
to be identified by the NER models as well as the correct entity type. The
evaluation for the challenge is established under strict and relaxed span matching
conditions [12, 13]. The exact matching condition takes into account the correct
identification of both, span and entity type. On the other hand, the relaxed
matching condition evaluates how accurate is the predicted span concerning the
real. Our models are evaluated with the ChEMU web page system for the official
results 5 and with the BRAT Eval tool for the offline analyses 6 .
5
    http://chemu.eng.unimelb.edu.au/
6
    https://bitbucket.org/nicta_biomed/brateval/src/master/
Table 1. Entity distribution in the development phase based on the official training
and development sets. Test set is the official development set. Dev set is random set
extracted from the official training set.

                              Train     Dev       Test      All
            Entity
                              (count/%) (count/%) (count/%) (count/%)
            example label     784/5     102/5     218/6     1104/5
            other compound 4095/28 545/29         1080/28 5720/28
            reaction product 1816/13 236/12       506/13    2558/13
            reagent catalyst 1135/8     146/8     289/8     1570/8
            solvent           1001/7    139/7     250/7     1390/7
            starting material 1543/11 211/11      413/11    2167/11
            temperature       1345/9    170/9     346/9     1861/9
            time              928/6     131/7     252/7     1311/6
            yield other       940/7     121/6     261/7     1322/7
            yield percent     848/6     107/6     228/6     1183/6
            All               14435/100 1908/100 3843/100 20186/100



3     Results and discussion

In this section, we present the results of our models in the development and
official test phases. Additionally, we perform error analyses on the results of the
test set used in the development phase for some relevant models.


3.1    Model’s performance in the development phase

Table 2 shows the exact and relaxed overall F1 -scores for all the models explored
by our team in the development phase of the ChEMU NER task. As we can see,
the ensemble model outperforms all the individual models for both exact and
relaxed metrics. On the other hand, despite being trained on a specialised corpus,
ChemBERTa achieves the lowest performance. The reported results come from
the ChEMU official evaluation web page except for the CNN, bert-large-uncased,
and the ensemble models, which are provided by the BRAT Eval tool.


Table 2. Performance of the different models in the development phase in terms of
F1 -score. *models evaluated using the BRAT Eval tool.

    Metric CRF CNN*          bert-base     bert-large  Chem Ensemble*
                          cased uncased cased uncased* BERTa
    exact 0.8722 0.8182 0.9140 0.9113 0.9079 0.9052     0.6810 0.9285
    relaxed 0.9450 0.8820 0.9732 0.9719 0.9706 0.9910   0.8500 0.9876



   The results of all models with respect to the individual entities are presented
in Table 3. As for the overall results, the ensemble model outperforms the in-
dividual models for all entities apart from time, for which the bert-base-cased
presents the best performance. The highest improvement for the ensemble model
is seen for the reaction product and starting material entities with over 12-point
increase in F1 -score. Considering only the individual models, the bert-base mod-
els outperform the other individual models, including the bert-large models, for
all the entities, apart from starting material, for which the CNN model has the
best performance.


 Table 3. Evaluation results on the development set for the exact F1 -score metric.

Entity           CRF CNN           bert-base     bert-large Chem Ensemble
                                cased uncased cased uncased BERTa
example label     0.9630 0.9526 0.9862 0.9817 0.9793 0.9769 0.9631 0.9885
other compound 0.8762 0.7409 0.8953 0.8938 0.8947 0.8925 0.7850    0.9052
reaction product 0.7535 0.8425 0.8586 0.8515 0.8410 0.8427 0.5957  0.8807
reagent catalyst 0.8330 0.8557 0.8595 0.8355 0.8498 0.8468 0.4673  0.8946
solvent           0.8949 0.7517 0.9447 0.9451 0.9407 0.9426 0.5945 0.9545
starting material 0.7253 0.8229 0.8072 0.8153 0.7995 0.7813 0.4405 0.8470
temperature       0.9796 0.6397 0.9842 0.9842 0.9827 0.9841 0.8105 0.9855
time              0.9900 0.8533 1.0000 0.9941 0.9941 0.9941 0.8141 0.9980
yield other       0.9046 0.9448 0.9905 0.9924 0.9811 0.9848 0.7135 0.9943
yield percent     0.9913 0.9693 0.9978 0.9978 0.9913 0.9892 0.7131 0.9978



     The ensemble model achieves the best performance for the time, yield other
and yield percent entities. We believe this is due to the patterns observed for
them in the training and test data. For example, for the yield percent entity, the
pattern is mostly a number followed by the percentage symbol (‘%’). Similarly,
for the time entity, the instances usually appear as a number followed for a
time-indicator word. On the other hand, the reaction product, reagent catalyst
and starting material entities show the lowest performance, with 88.07%, 89.46%
and 84.70% of F1 -score, respectively. These entities are of chemical types, often
molecule strings (e.g., 4-(6-Bromo-3-methoxypyridin-2-yl)-6-chloropyrimidin-2-
amine) [12, 13]. As our models did not include a post-processing step, as proposed
in [33], these entities were sometimes recognized partially as a result of the
language model sub-word tokenization process.
     During the development phase, we also investigate the performance of Chem-
BERTa. As ChemBERTa is a language model trained on the chemical domain,
it is expected to achieve competitive results. However, for the NER downstream
task in chemical patents, the results go in a different direction. As shown in
Table 3, ChemBERTa obtains the lowest results among all the explored models
for both exact and relaxed metrics. We believe that the size of the corpus used
to train the other explored language models has led to better chemical entity
representations. Additionally, as the task aims to identify other entities than
molecules, the ChemBERTa model naturally fails as its train set is only based
on SMILES strings.
3.2   Model’s performance in the test phase

In the official test phase, 9, 999 files containing snippets from chemical patents
were available for evaluating the models. We submitted 3 official runs: run 1,
based on the baseline CRF model; run 2, based on the bert-base-cased model;
and run 3, based on the ensemble model. Table 4 shows the official performance of
our models for the exact and relaxed span matching metrics in terms of F1 -score.
The ensemble model achieves 92.30% of exact F1 -score, yielding more than 11-
point improvement over our baseline and at least 1-point improvement over the
best individual contextualized language model (bert-base-cased). It outperforms
run 1 and run 2 for all the entities in both exact and relaxed metrics. We believe
that the performance difference between the CRF model and the ensemble model
is due mostly to the fact that language models based on attention mechanisms
are able to provide better contextual feature representations without the specific
design of hand-crafted features as in the case of CRF.


Table 4. Official performance of our models in terms of F1 -score for the exact and
relaxed metrics.

                               CRF      bert-base-cased Ensemble
        Entity
                          exact relaxed exact relaxed exact relaxed
        example label     0.9190 0.9367 0.9617 0.9730 0.9669 0.9784
        other compound 0.8310 0.9029 0.8780 0.9608 0.8920 0.9653
        reaction product 0.6462 0.7689 0.8593 0.9378 0.8766 0.9322
        reagent catalyst 0.7598 0.8035 0.8791 0.9082 0.9022 0.9176
        solvent           0.8299 0.8323 0.9444 0.9491 0.9541 0.9541
        starting material 0.4957 0.6752 0.8413 0.9343 0.8701 0.9394
        temperature       0.9499 0.9688 0.9692 0.9902 0.9729 0.9877
        time              0.9698 0.9843 0.9868 0.9967 0.9879 0.9978
        yield other       0.8984 0.8984 0.9799 0.9821 0.9842 0.9865
        yield percent     0.9705 0.9807 0.9936 0.9962 0.9974 0.9974
        ALL               0.8056 0.8683 0.9098 0.9596 0.9230 0.9624



    The 5-top best performing entities identified by our models are example label,
temperature, time, yield other, yield percent, which is similar to the results found
in the development phase. For all of our submissions, the entity with lowest per-
formance in the official test phase is starting material, achieving 49.57%, 84.13%
and 87.01% of exact F1 -score in the CRF, bert-base-cased and ensemble models,
respectively. As we will see further in the error analyses section, this entity is
often confused with the reagent catalyst entity in the development phase. From
the chemistry point of view, both starting material (reactants) and catalysts
(reagents) entities are present at the start of the reaction, with the difference
that the latter is not consumed by the reaction. These terms are often used in-
terchangeably though, which could be the reason for the confusion. Despite the
much larger size of the test set (approximately 10 times the size of the training
set), these results suggest that the test set has a similar entity distribution of
the dataset provided in the development phase.
    In Table 5 is shown a summary of the top ten official results, including our
runs 2 and 3 (BiTeM team, ranked 6 and 7), the best model and the challenge
baseline. If we consider the exact F1 -score metric, our ensemble model shows
at least 3-point improvement from the ChEMU Task 1 NER baseline and more
than 3-point behind the top 1. For the relaxed metric, our best model performs
slightly better, showing more than 5-point improvement from the baseline and
less than 1-point below the top system.

Table 5. Official BiTeM results compared to the best model and the BANNER base-
line.

                                 Precision      Recall       F1 -score
    Rank Team
                              exact relaxed exact relaxed exact relaxed
       1    Melaxtech         0.9571 0.9690 0.9570 0.9687 0.9570 0.9688
       6    BiTeM (run 3)     0.9378 0.9692 0.9087 0.9558 0.9230 0.9624
       7    BiTeM (run 2)     0.9083 0.9510 0.9114 0.9684 0.9098 0.9596
      10    Baseline (BANNER) 0.9071 0.9219 0.8723 0.8893 0.8893 0.9053


    The performance of the ensemble model for all entities on test set in terms
of precision, recall and F1 -score for both exact match and relax is presented in
Table 6. The best precision and recall for the exact match metric are achieved
for the yield percent entity, reaching 99.74% and 99.74%, respectively. Overall,
precision is always above 93% for the relaxed metric and at least 88% for the
exact metric.

Table 6. Performance of the ensemble model for all entities on the test set in terms of
precision, recall and F1 -score

                                Precision      Recall       F1 -score
           Entity
                             exact relaxed exact relaxed exact relaxed
           example label     0.9711 0.9827 0.9628 0.9742 0.9669 0.9784
           other compound 0.9197 0.9730 0.8659 0.9578 0.8920 0.9653
           reaction product 0.8942 0.9367 0.8596 0.9277 0.8766 0.9322
           reagent catalyst 0.9268 0.9435 0.8790 0.8931 0.9023 0.9176
           solvent           0.9620 0.9620 0.9463 0.9463 0.9541 0.9541
           starting material 0.8886 0.9545 0.8523 0.9247 0.8701 0.9394
           temperature       0.9769 0.9901 0.9690 0.9852 0.9729 0.9876
           time              0.9846 0.9956 0.9912 1.0000 0.9879 0.9978
           yield other       0.9776 0.9798 0.9909 0.9932 0.9842 0.9865
           yield percent     0.9974 0.9974 0.9974 0.9974 0.9974 0.9974



    Lastly, our CRF baseline achieves 80.56% of exact F1 -score, while the compe-
tition baseline, which is based also on CRF, but customized for biomedical NER,
taking into account features, such as part-of-speech, lemma, Roman numerals,
names of the Greek letters, achieves 88.93% [19]. Indeed, we believe those features
give the advantage to the competition baseline as they could better characterize
chemical entities.

3.3   Error analysis
As the gold annotations for the test set are not available, we perform the error
analysis on the official development set (used as our test set in the develop-
ment phase, see Table 1). Fig. 2 shows the confusion matrix for the ensemble
predictions for the exact metric. As we can see, most confusion occurred for
the starting material entity, which is mostly confused with reagent catalyst, and
for the reaction product entity, which is mistaken for other compound. As men-
tioned previously, these entities - material/reactant and catalyst/reagent, and
product/compound - are often used interchangeably in chemistry passages, which
is likely the reason for the model’s confusion.




Fig. 2. Normalized confusion matrix for the ensemble model predictions on the official
development set. Only exact matches are considered.


    The error analysis of the incorrectly identified spans by the ensemble model
shows that in almost 78.8% of the cases, the predicted entity was longer in length,
for example, sodium thiosulfate aqueous instead of aqueous and concentrated
hydrochloric acid instead of hydrochloric acid. The entities that are partially
detected are mainly starting material, which is inconsistently annotated in some
cases, as Intermediate 13/6/21 (predicted as 13/6/21 by the ensemble model),
and in some cases as only the number, such as 3 (predicted as Intermediate 3
by the ensemble model). 42.3% of the span errors were multi-word entities.
    Fig. 3 shows how different models detected a reagent catalyst entity de-
scribed by a long text span. It seems that entities with longer text span, such
as reagent catalyst, other compound, reaction product, and starting material, are
less likely to be correctly detected by the contextualized language models. The
bert-large-uncased and ChemBERTa models did not detect any token of the en-
tity while both bert-large-cased and bert-base-cased models were able to only
partially detect the entity. Particularly, the larger nature of the BERT large
models was not translated into more effective representations for these entities.




Fig. 3. An example of predictions by different models for (reagent catalyst) annotation.
The span detected by each model is color-coded.


    Figure 4 shows the comparison of the span errors of the ensemble and BERT-
base-cased models based on the length of entities (in character). While most er-
rors of both models are focused on smaller entities, the BERT-base-cased model
makes more mistakes than the ensemble model in detecting the spans of the
longer entities. We believe this effect could be also related to the sub-word tok-
enization process of transformers. The combination of models smooths the effect
in the ensemble model.
Fig. 4. Number of span errors by the ensemble and BERT-base-cased models based on
the length of the entities (in character).



4   Conclusions

In this task, we explored the use of contextualized language models based on the
transformer architecture to extract information from chemical patents. The com-
bination of language models resulted in an effective approach, outperforming the
baseline CRF model but also individual transformer models. Our experiments
show that without extensive pre-training in the patent chemical domain, the ma-
jority vote approach is able to leverage distinctive features present in the English
language, achieving 92.30% of exact F1 -score in the ChEMU NER task. It seems
that the transformer models are able to take advantage of natural language con-
texts in order to capture the most relevant features without supervision in the
chemical domain. Our next step will be to investigate pre-trained models on
large chemical patent corpora to further improve the NER performance.


References

 1. Akhondi, S.A., Klenner, A.G., Tyrchan, C., Manchala, A.K., Boppana, K., Lowe,
    D., Zimmermann, M., Jagarlapudi, S.A.R.P., Sayle, R., Kors, J.A., Muresan, S.:
    Annotated chemical patent corpus: A gold standard for text mining. PLoS ONE
    9(9), e107477 (Sep 2014)
 2. Akhondi, S.A., Pons, E., Afzal, Z., van Haagen, H., Becker, B.F., Hettne, K.M., van
    Mulligen, E.M., Kors, J.A.: Chemical entity recognition in patents by combining
    dictionary-based and statistical approaches. Database 2016 (2016)
 3. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language
    model. Journal of machine learning research 3(null), 1137–1155 (Mar 2003)
 4. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:
    Natural language processing (almost) from scratch. Journal of machine learning
    research 12, 2493–2537 (Nov 2011)
 5. Copara, J., Knafou, J., Naderi, N., Moro, C., Ruch, P., Teodoro, D.: Contextualized
    French Language Models for Biomedical Named Entity Recognition. In: Cardon,
    R., Grabar, N., Grouin, C., Hamon, T. (eds.) 6e conférence conjointe Journées
    d’Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues
    Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Infor-
    matique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition).
    Atelier DÉfi Fouille de Textes. pp. 36–48. ATALA, Nancy, France (2020)
 6. Copara, J., Ochoa Luna, J.E., Thorne, C., Glavaš, G.: Spanish NER with word rep-
    resentations and conditional Random Fields. In: Proceedings of the Sixth Named
    Entity Workshop. pp. 34–40. Association for Computational Linguistics, Berlin,
    Germany (Aug 2016)
 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep
    bidirectional transformers for language understanding. In: Proceedings of the 2019
    Conference of the North American Chapter of the Association for Computational
    Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
    pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota
    (Jun 2019)
 8. Grishman, R.: Twenty-five years of information extraction. Natural Language En-
    gineering 25(06), 677–692 (Sep 2019)
 9. Guo, J., Che, W., Wang, H., Liu, T.: Revisiting embedding features for simple
    semi-supervised learning. In: Proceedings of the 2014 Conference on Empirical
    Methods in Natural Language Processing (EMNLP). pp. 110–120. Association for
    Computational Linguistics, Doha, Qatar (Oct 2014)
10. Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with
    word embeddings improves biomedical named entity recognition. Bioinformatics
    33(14), i37–i48 (Jul 2017)
11. Habibi, M., Wiegandt, D.L., Schmedding, F., Leser, U.: Recognizing chemicals in
    patents: A comparative analysis. Journal of Cheminformatics 8(1) (Oct 2016)
12. He, J., Nguyen, D.Q., Akhondi, S.A., Druckenbrodt, C., Thorne, C., Hoessel, R.,
    Afzal, Z., Zhai, Z., Fang, B., Yoshikawa, H., Albahem, A., Cavedon, L., Cohn, T.,
    Baldwin, T., Verspoor, K.: Overview of chemu 2020: Named entity recognition and
    event extraction of chemical reactions from patents. In: Arampatzis, A., Kanoulas,
    E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., Cap-
    pellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality,
    and Interaction. Proceedings of the Eleventh International Conference of the CLEF
    Association (CLEF 2020), vol. 12260. Lecture Notes in Computer Science (2020)
13. He, J., Nguyen, D.Q., Akhondi, S.A., Druckenbrodt, C., Thorne, C., Hoessel, R.,
    Afzal, Z., Zhai, Z., Fang, B., Yoshikawa, H., Albahem, A., Wang, J., Ren, Y., Zhang,
    Z., Zhang, Y., Hoang Dao, M., Ruas, P., Lamurias, A., M. Couto, F., Copara, J.,
    Naderi, N., Knafou, J., Ruch, P., Teodoro, D., Lowe, D., Mayfield, J., Köksal, A.,
    Dönmez, H., Özkırımlı, E., Özgür, A., Mahendran, D., Gurdin, G., Lewinski, N.,
    Tang, C., T.McInnes, Bridget C.S., M., RK Rao., P., Lalitha Devi, S., Cavedon, L.,
    Cohn, T., Baldwin, T., Verspoor, K.: An extended overview of the clef 2020 chemu
    lab: Information extraction of chemical reactions from patents. In: Proceedings
    of the Eleventh International Conference of the CLEF Association (CLEF 2020)
    (2020)
14. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification.
    In: Proceedings of the 56th Annual Meeting of the Association for Computational
    Linguistics (Volume 1: Long Papers). pp. 328–339. Association for Computational
    Linguistics, Melbourne, Australia (Jul 2018)
15. Irwin, J.J., Shoichet, B.K.: Zinc – a free database of commercially available com-
    pounds for virtual screening. Journal of Chemical Information and Modeling 45(1),
    177–182 (2005), pMID: 15667143
16. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio,
    Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations,
    ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings
    (2015)
17. Krallinger, M., Rabal, O., Lourenco, A., Perez, M., Pérez-Rodrı́guez, G., Vazquez,
    M., Leitner, F., Oyarzabal, J., Valencia, A.: Overview of the CHEMDNER patents
    task. Proceedings of the Fifth BioCreative Challenge Evaluation Workshop pp.
    63–75 (01 2015)
18. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Prob-
    abilistic models for segmenting and labeling sequence data. In: Proceedings of the
    Eighteenth International Conference on Machine Learning. p. 282–289. ICML ’01,
    Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)
19. Leaman, R., Gonzalez, G.: Banner: An executable survey of advances in biomedical
    named entity recognition. In: Altman, R.B., Dunker, A.K., Hunter, L., Murray,
    T., Klein, T.E. (eds.) Pacific Symposium on Biocomputing. pp. 652–663. World
    Scientific (2008)
20. Leaman, R., Wei, C.H., Lu, Z.: tmChem: a high performance approach for chemical
    named entity recognition and normalization. Journal of Cheminformatics 7(S1)
    (Jan 2015)
21. Lecun, Y.: Generalization and network design strategies. Technical Report CRG-
    TR-89-4, University of Toronto (June 1989)
22. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized BERT pretraining
    approach. CoRR abs/1907.11692 (2019)
23. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representa-
    tions of words and phrases and their compositionality. In: Proceedings of the 26th
    International Conference on Neural Information Processing Systems - Volume 2.
    p. 3111–3119. NIPS’13, Curran Associates Inc., Red Hook, NY, USA (2013)
24. Okurowski, M.E.: Information extraction overview. In: TIPSTER TEXT PRO-
    GRAM: PHASE I: Proceedings of a Workshop held at Fredricksburg, Virginia,
    September 19-23, 1993. pp. 117–121. Association for Computational Linguistics,
    Fredericksburg, Virginia, USA (Sep 1993)
25. Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word represen-
    tation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural
    Language Processing (EMNLP). Association for Computational Linguistics (2014)
26. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle-
    moyer, L.: Deep contextualized word representations. In: Proceedings of the 2018
    Conference of the North American Chapter of the Association for Computational
    Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association
    for Computational Linguistics (2018)
27. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language
    models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
28. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recog-
    nition. In: Proceedings of the Thirteenth Conference on Computational Natural
    Language Learning (CoNLL-2009). pp. 147–155. Association for Computational
    Linguistics, Boulder, Colorado (Jun 2009)
29. Rocktäschel, T., Weidlich, M., Leser, U.: ChemSpot: a hybrid system for chemical
    named entity recognition. Bioinformatics 28(12), 1633–1640 (Apr 2012)
30. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with
    subword units. In: Proceedings of the 54th Annual Meeting of the Association for
    Computational Linguistics (Volume 1: Long Papers). pp. 1715–1725. Association
    for Computational Linguistics, Berlin, Germany (Aug 2016)
31. Serrà, J., Karatzoglou, A.: Getting Deep Recommenders Fit: Bloom Embeddings
    for Sparse Binary Input/Output Networks. In: Proceedings of the Eleventh ACM
    Conference on Recommender Systems. p. 279–287. RecSys ’17, Association for
    Computing Machinery, New York, NY, USA (2017)
32. Sutton, C.: An introduction to Conditional Random Fields. Foundations and
    Trends R in Machine Learning 4(4), 267–373 (2012)
33. Teodoro, D., Gobeill, J., Pasche, E., Ruch, P., Vishnyakova, D., Lovis, C.: Auto-
    matic ipc encoding and novelty tracking for effective patent mining. In: The 8th
    NTCIR Workshop Meeting on Evaluation of Information Access Technologies: In-
    formation Retrieval, Question Answering, and Cross-Lingual Information Access
    (2010)
34. Valentinuzzi, M.E.: Patents and scientific papers: Quite different concepts: The
    reward is found in giving, not in keeping [retrospectroscope]. IEEE Pulse 8(1),
    49–53 (2017)
35. Weininger, D.: Smiles, a chemical language and information system. 1. introduc-
    tion to methodology and encoding rules. Journal of Chemical Information and
    Computer Sciences 28(1), 31–36 (Feb 1988)
36. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,
    M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X.,
    Lukasz Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian,
    G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O.,
    Corrado, G., Hughes, M., Dean, J.: Google’s neural machine translation system:
    Bridging the gap between human and machine translation. arXiv (2016)
37. Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition
    from deep learning models. In: Proceedings of the 27th International Conference
    on Computational Linguistics. pp. 2145–2158. Association for Computational Lin-
    guistics, Santa Fe, New Mexico, USA (Aug 2018)
38. Zhai, Z., Nguyen, D.Q., Akhondi, S., Thorne, C., Druckenbrodt, C., Cohn, T., Gre-
    gory, M., Verspoor, K.: Improving chemical named entity recognition in patents
    with contextualized word embeddings. In: Proceedings of the 18th BioNLP Work-
    shop and Shared Task. pp. 328–338. Association for Computational Linguistics,
    Florence, Italy (Aug 2019)
39. Zhang, Y., Xu, J., Chen, H., Wang, J., Wu, Y., Prakasam, M., Xu, H.: Chemical
    named entity recognition in patents by domain knowledge and unsupervised feature
    learning. Database 2016 (2016)
40. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler,
    S.: Aligning books and movies: Towards story-like visual explanations by watching
    movies and reading books. In: Proceedings of the IEEE International Conference
    on Computer Vision (ICCV). p. 19–27. IEEE Computer Society, USA (2015)