CLRG ChemNER: A Chemical Named Entity
        Recognizer @ ChEMU CLEF 2020

           Malarkodi C.S., Pattabhi, RK Rao., and Sobha, Lalitha Devi

                     Computational Linguistics Research Group,
                           AU-KBC Research Centre,
                   MIT Campus of Anna University, Chennai, India
                               sobha@au-kbc.org


        Abstract. This paper describes our system developed for ChEMU @
        CLEF Cheminformatics Elsevier Melbourne University lab, Named En-
        tity Recognition (NER) task for identifying chemical compounds as well
        as their types in context, i.e., to assign the label of a chemical compound
        according to the role which the compound plays within a chemical reac-
        tion from patent documents. We have presented two systems which use
        Conditional random fields (CRFs) algorithms and Artificial Neural Net-
        works (ANN). In this work we used feature set that includes linguistic,
        orthographical and lexical clue features. In the development of systems,
        we have used only the training data provided by the track organizers and
        no other external resources or embedding models were used. We obtained
        an F-score of 0.6640 using CRFs and F-Score of 0.3764 using ANN on
        the test data.

        Keywords: Chemical named entity recognition, Artificial Neural Net-
        works, Conditional random fields, · Patent Documents.


1     Introduction
CLEF 2020 ChEMU NER task aims to automatically identify chemical com-
pounds and their specific types, i.e. to assign the label of a chemical compound
according to the role it plays in a chemical reaction. In addition to chemical com-
pounds, the task also requires identification of the temperature and the reaction
time at which the chemical reaction was carried out, the yields obtained for the
final chemical product and the label of the reaction. The focuses of this task is
mainly on information extraction from chemical patents. This is a challenging
task as patents are written very differently as compared to scientific literature.
When writing scientific papers, authors strive to make their words as clear and
straightforward as possible, whereas patent authors often seek to protect their
knowledge from being fully disclosed [8]. Thus the main challenge for Natural
Language Processing (NLP) in patent documents arises from its writing style,
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
very long winding complex sentences and listing of chemical compounds. As syn-
tactic deep parsing is difficult for such sentence constructions, for this work we
decided to use shallow parsing. The paper describes the work we have done in
developing NER systems for this task “ChEMU NER task”.
    We pre-processed the data provided by the task organizers to the required
format to develop our NER systems. Subsequently, features were extracted and
trained for the identification of the entities from the corpus using Machine learn-
ing (ML) algorithms. In section 2, we briefly review the recent literature. In the
following section 3, features and the method used to develop the language mod-
els are described. Results are discussed in section 4. The paper ends with the
conclusion.


2   Literature Review

In recent years Deep Learning is flourishing as a well-known ML methodology
for NLP applications. By using the multilayer neural architecture it can learn
the hidden patterns from the enormous amount of data and handles the complex
problems. This section briefly explains the recent research works in the field of
NER using Deep Learning. The earlier work on neural network was done by Gallo
et.al [2] to classify named entities in ungrammatical text. Their implementation
of Multi-Layer Perceptron (MLP) is called as Sliding Window Neural (SwiN)
which was specifically developed for grammatically problematic text where the
linguistic features could fail. The Deep Neural Framework was developed by Yao
et al. [11] to identify the biomedical named entities. They have trained the word
representation model on PubMed database with the help of skip-gram model.
Xia et al. [10] built a single neural network for identifying multi-level nested
entities and non-overlapping NEs. Kuru et al.[4] used character level representa-
tion to identify named entities. They have utilized Bi-LSTMs to predict the tag
distribution for each character. Wei et al.[9] have developed a CRF based neural
network for identifying the disease names. Along with word embeddings their
system has utilized words, PoS information, chunk information and word shape
features. Hong et al.[3] developed a deep learning architecture for BioNER which
is called as DTranNER. It learns the label to label transition using the contextual
information. The Unary-Network concentrates on the tag-wise labelling and the
pair-wise network predict the transition suitability between labels. Then these
networks are plugged into the CRF of the deep learning framework. In the re-
cent past, the models combining word level and character lever representations
are being used. These methods concatenate word embeddings with LSTMs (or
Bi-LSTMs) over the characters of a word, passing this representation through
another sentence-level Bi-LSTM, and predicting the final tags using a final soft-
max or CRF layer. Lample et al. [6] introduced this architecture and achieved
85.75%, 81.74%, 90.94%, 78.76% F- scores on Spanish, Dutch, English and Ger-
man NER dataset respectively from CoNLL 2002 and 2003. Dernoncourt et al.
[1] implemented this model in the Neuro NER toolkit with the main goal of
providing easy usability and allowing easy plotting of real time performance and
learning statistics of the model.


3     Method

In this section we present our systems developed using Condition Random Fields
(CRFs) and Artificial Neural Networks (ANN). For our work we use CRF++
tool and Scikit python package. CRF++1 tool is an open source implementation
of CRFs and is a general purpose tool. Our NER system follows a pipeline
architecture, where the data is first pre-processed to required format that is
needed to train the system. After training the system the NEs are automatically
identified from the test set.


3.1    Feature Selection

Feature selection is an important step in the ML approach for NER. Features
play an important role in boosting the performance of the system. Features
selected must be informative and relevant. We have used word level features,
grammatical features, functional terms features that are detailed below:
    1. Word level features: Word level features include Orthographical fea-
tures and Morphological features.
    (a) Orthographical features contain Capitalization, combination of digits,
symbols and words and Greek words.
    (b) Prefix and suffix of chemical entities is considered as morphological fea-
tures.
    2.Grammatical features: Grammatical features include word, PoS, chunks
and combination of word, PoS and chunk.
    3. Functional term feature: Functional term helps to identify the biologi-
cal named entities and categorize them to various classes. Example: Alkyl, acid,
alkanylene.
    Grammatical features of Part-of-Speech (PoS) and chunk information are
obtained using automatic tools. More details about the tools are given in next
sub-section. The morphological features are obtained by extracting ‘n’ last and
starting characters of chemical entities. After performing a few experiments,
it was identified that n=4 is the optimal one. A Functional terms lexicon was
collected from online sources.


3.2    Pre-processing

The data is pre-processed using a sentence splitter and tokenizer and is con-
verted into column format with entities tagged using the file containing detailed
chemical mention annotation (BRAT format annotation file). The sentence split-
ter and the tokenizer used are rule based engines, developed in-house. In these
1
    https://taku910.github.io/crfpp/
engines we have done a modification by adding a special processing to accom-
modate long entity names with more than 200 characters. We have split these
long names into two tokens and then combined it as one after PoS tagging and
Phrase chunking is completed.
     Then the data is annotated for syntactic information of Part-of-speech (PoS)
and Phrase Chunk information (Noun Phrase, Verb phrase) using fnTBL tool
[7], an open source tool.


3.3   Named Entity Identification

The features from the pre-processed data are extracted as explained in section
3.1.The data format is similar for both the algorithms. The systems are trained
using the same features extracted from the data. Using these models the chem-
ical named entities from the test data were automatically identified. Chemical
entity mention in patents requires the detection of the start and end indices
corresponding to all chemical entities. Hence we converted the output from the
system to the required BRAT format for task submission. The NER language
models developed using CRFs, used the features as explained in section 3.1 from
training. Using the NER model the NEs are automatically identified from the
testing corpus. All the features were extracted from the training corpus provided
by the organizers and no other external resources have been used. Similarly, we
have followed for the ANN system also. ANN system is described below.


Artificial Neural Networks (ANN) A Multi-Layer Perceptron (MLP) is
a feed forward Artificial Neural Network (ANN). The input layer receives the
input data in the numerical form, the output layer takes the decision about the
input and the hidden layers exists between these two acts as the computational
engine.
    The three important steps involved in neural network are 1) each input is
multiplied by a weight 2) all the weighted inputs are added together with the
bias and 3) the sum is passed through the activation function. The input node
accepts the information in a numerical form and depending on the weight and
the transfer function, the activation value which is the weighted sum of inputs
will be passed to the next node. The activation function is used to monitor the
threshold level and convert the unbounded value into a bounded one. Each node
in the network runs the activation value and tweak the sum based on its transfer
function. The activation function runs through the entire network until it reaches
an output node. The traditional systems used sigmoid or the hyperbolic tangent
activation function.
    In this work, Multilayer Perceptron (MLP) network is used. MLP is a com-
bination of layers of perceptrons connected together. The first layer’s output
goes as the input to the next layer until it reaches the output layer. The hidden
layer is the layer that exists between the input and output layer Feed forward
networks like MLP have two passes, namely forward and backward passes. In
the forward pass, propagate forward to get the output and compare it with the
                   Fig. 1. ANN Architecture - of MLP Network


real value to get the error. In order to reduce the error multilayer perceptrons
propagate backward and adjust the weights. This process of back-propagation is
used to adjust the weight and bias relative to the error. This process continues
until the estimated output is obtained. ReLu activation function is used in MLP.
The feed forward network consists of two motions namely forward and backward
pass. The training process comprises of 3 steps, they are forward pass, error cal-
culation and backward pass. In forward pass the input data is multiplied by its
weight and added to the bias at each node and passes through the hidden layer
to the output layer. The cost function is used to predict the performance of the
model, which can be computed as the difference between the predicted and the
expected value. Once the loss is calculated we have to back propagate the loss
in order to update the weights of each node using the gradient descent function.
The weights are being tweaked according to the gradient flow in that direction.
The main intention here is to minimize the loss.
    In this work we have used the python package of Scikit-Learn’s Multi-Layer
Perceptron. The process of converting the input data into numerical feature
vectors is called as vectorization and it involves mainly three steps namely to-
kenization, counting and normalization. The resultant data is called as Bag of
words representation. In this form rather than the relative position of the words
in the document the input text is represented using the word occurrences. We
have used the countvectorizer to represent the data into bag of words format. It
converts the text data into the numerical features. The TFIDFVectorizer is used
to convert the bag of words into the matrix of TFIDF features. After initiating
the size of the hidden layer and determining the activation and optimization the
data is given for the training process. The ReLu activation function is used for
the hidden layers of the present MLP implementation. The stochastic gradient
optimizer Adam is used for the weight optimization.
    The number of layers we used for the hidden layer is 30, the activation func-
tion utilized for the hidden layer is ReLu (Rectified Linear unit function), ‘adam’
stochastic gradient-based optimizer is used as a solver for the weight optimiza-
tion. ’Alpha’ regularization parameter is set to 0.0001. The learning rate schedule
used for weight updation is ’constant’. It is the constant learning rate provided
by the initial learning rate which helps to control the step-size in updating the
weights and the ’learning rateinit’ value is set as 0.001. Batch size refers to the
number of training examples used in one iteration. The batch size is assigned
as mini batches for stochastic optimizers and it is set to 200 by default. The
architecture of MLP implementation is shown in figure 1.


4   Results and Discussion

The system outputs on the test data were evaluated by the track organizers,
precision, recall and F score were calculated. The test results are tabulated in
Table 1 and 2. Table 1 provides the test results of the system developed using
CRFs and Table 2 presents results of system using ANN.


                Table 1. Test Results of CRFs based NER system


                NE Label          Precision Recall F1 Score
                EXAMPLE LABEL     0.9732    0.4155 0.5823
                OTHER COMPOUND 0.9378       0.5656 0.7057
                REACTION PRODUCT 0.8600     0.4118 0.5569
                REAGENT CATALYST 0.8214     0.7495 0.7838
                SOLVENT           0.8776    0.7066 0.7828
                STARTING MATERIAL 0.7192    0.7862 0.7512
                TEMPERATURE       0.9802    0.7288 0.8360
                TIME              0.9954    0.4779 0.6457
                YIELD OTHER       0.9315    0.1545 0.2651
                YIELD PERCENT     0.8571    0.0154 0.0303
                Average           0.8793    0.5334 0.6640


    The system based on CRFs had given a very good precision. The recall is
low and especially for the entities “YIELD OTHER” and “YIELD PERCENT”.
This could have been improved by using post processing rules. The results ob-
tained using ANN are lower than the CRFs based system. This clearly shows that
the training data size is not sufficient for ANN. The ANN system requires used
of external resources such as pre-trained word embeddings and other available
annotated resources.
                 Table 2. Test Results of ANN based NER system


                NE Label          Precision Recall F1 Score
                EXAMPLE LABEL     0.5455    0.0172 0.0333
                OTHER COMPOUND 0.5977       0.2430 0.3455
                REACTION PRODUCT 0.5755     0.2573 0.3556
                REAGENT CATALYST 0.8620     0.5521 0.6731
                SOLVENT           0.8958    0.4066 0.5593
                STARTING MATERIAL 0.6486    0.4480 0.5299
                TEMPERATURE       0.8239    0.2369 0.3680
                TIME              0.9504    0.2544 0.4014
                YIELD OTHER       0.1429    0.0114 0.0211
                YIELD PERCENT     0.1429    0.0334 0.0542
                Average           0.6686    0.2619 0.3764


5   Conclusion

We submitted two systems developed using Machine Learning (ML) techniques,
Condition Random Fields (CRFs) and Artificial Neural Networks (ANN). A two
stage pre-processing is done on development data 1) the formatting stage, where
the sentence splitting, tokenizing and the data conversion to column format and
2) the data annotation stage, where the data is annotated for syntactic infor-
mation of Part-of-speech (POS) and Phrase Chunk information (Noun Phrase,
Verbphrase) are performed. For both POS and Chunk information, fnTBL [7],
an open source tool is used. We have used the training data provided by the
task organizers and have not used any external resources or pre-trained lan-
guage models. The language models are developed using CRFs and ANN. The
CRF++ tool is used for developing the CRF model. The ANN application uses
the Scikit python package. ANN uses a Multilayer Perceptron (MLP). ReLu
activation function is used in MLP. The stochastic gradient optimizer Adamis
used for the weight optimization. It can adjust and calculate the learning rates
for different parameters at each node. We obtained an F-score of 0.6640 using
CRFs and F-Score of 0.3764 using ANN for the test data. It can be observed
from the results CRFs performed better for the given training data. This shows
ANN’s require more training data or pre-trained models. A better solution can
be arrived at by combining ANN and CRFs. In future we would like to combine
ANN and CRFs.


References

1. Franck Dernoncourt, Ji Young Lee, and Peter Szolovits.: Neuroner: an easy-to-use
   program for named-entity recognition based on neural networks. In: arXiv preprint
   arXiv:1705.05487 (2017)
2. Gallo, I., Binaghi, E., Carullo, M., Lamberti, N.:Named entity recognition by neural
   sliding window. In:The Eighth IAPR International Workshop on Document Analysis
   Systems, pp. 567–573. IEEE (2008)
3. Hong, S. K., and Lee, J. G.: DTranNER: biomedical named entity recognition with
   deep learning-based label-label transition model. BMC Bioinformatics 21(1), 53-73
   (2020)
4. Kuru, O., Can, O. A., Yuret, D.: Charner: Character-level named entity recognition.
   In:Proceedings of COLING 2016, the 26th International Conference on Computa-
   tional Linguistics, pp. 911–921 (2016)
5. Lafferty, J., McCallum, A., and Pereira, F.: Conditional random fields: Probabilistic
   models for segmenting and labeling sequence data. In: Proceedings of International
   Conference on Machine Learning, pp. 282–289 (2001)
6. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami,
   and Chris Dyer.: Neural architectures for named entity recognition. In: arXiv
   preprint arXiv:1603.01360. (2016)
7. Grace Ngai and Radu Florian. Transformation-based learning in the fast lane. In:
   Proceedings of North Americal ACL 2001, pages 40–47 (2001)
8. Max, Valentinuzzi.: Patents and Scientific Papers: Quite Different Concepts. IEEE
   Pulse 8(1), pp. 49-–53 (2017)
9. Q. Wei, T. Chen, R. Xu, Y. He, and L. Gui.: Disease named entity recognition by
   combining conditional random fields and bidirectional recurrent neural networks.
   Database 2016.
10. Xia, C., Zhang, C., Yang, T., Li, Y., Du, N., Wu, X., Yu, P.: Multi-grained named
   entity recognition. In: arXiv preprint arXiv:1906.08449. (2019)
11. Yao, L., Liu, H., Liu, Y., Li, X., Anwar, M. W.: Biomedical named entity recog-
   nition based on deep neutral network. International Journal of Hybrid Information
   Technology 8(8), pp. 279 – 288 (2015)