<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CLRG ChemNER: A Chemical Named Entity Recognizer @ ChEMU CLEF 2020</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Malarkodi C.S.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pattabhi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>RK Rao.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sobha</string-name>
          <email>sobha@au-kbc.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lalitha Devi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computational Linguistics Research Group, AU-KBC Research Centre, MIT Campus of Anna University</institution>
          ,
          <addr-line>Chennai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our system developed for ChEMU @ CLEF Cheminformatics Elsevier Melbourne University lab, Named Entity Recognition (NER) task for identifying chemical compounds as well as their types in context, i.e., to assign the label of a chemical compound according to the role which the compound plays within a chemical reaction from patent documents. We have presented two systems which use Conditional random elds (CRFs) algorithms and Arti cial Neural Networks (ANN). In this work we used feature set that includes linguistic, orthographical and lexical clue features. In the development of systems, we have used only the training data provided by the track organizers and no other external resources or embedding models were used. We obtained an F-score of 0.6640 using CRFs and F-Score of 0.3764 using ANN on the test data.</p>
      </abstract>
      <kwd-group>
        <kwd>Chemical named entity recognition</kwd>
        <kwd>Arti cial Neural Networks</kwd>
        <kwd>Conditional random elds</kwd>
        <kwd>Patent Documents</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        CLEF 2020 ChEMU NER task aims to automatically identify chemical
compounds and their speci c types, i.e. to assign the label of a chemical compound
according to the role it plays in a chemical reaction. In addition to chemical
compounds, the task also requires identi cation of the temperature and the reaction
time at which the chemical reaction was carried out, the yields obtained for the
nal chemical product and the label of the reaction. The focuses of this task is
mainly on information extraction from chemical patents. This is a challenging
task as patents are written very di erently as compared to scienti c literature.
When writing scienti c papers, authors strive to make their words as clear and
straightforward as possible, whereas patent authors often seek to protect their
knowledge from being fully disclosed [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Thus the main challenge for Natural
Language Processing (NLP) in patent documents arises from its writing style,
very long winding complex sentences and listing of chemical compounds. As
syntactic deep parsing is di cult for such sentence constructions, for this work we
decided to use shallow parsing. The paper describes the work we have done in
developing NER systems for this task \ChEMU NER task".
      </p>
      <p>We pre-processed the data provided by the task organizers to the required
format to develop our NER systems. Subsequently, features were extracted and
trained for the identi cation of the entities from the corpus using Machine
learning (ML) algorithms. In section 2, we brie y review the recent literature. In the
following section 3, features and the method used to develop the language
models are described. Results are discussed in section 4. The paper ends with the
conclusion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Literature Review</title>
      <p>
        In recent years Deep Learning is ourishing as a well-known ML methodology
for NLP applications. By using the multilayer neural architecture it can learn
the hidden patterns from the enormous amount of data and handles the complex
problems. This section brie y explains the recent research works in the eld of
NER using Deep Learning. The earlier work on neural network was done by Gallo
et.al [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to classify named entities in ungrammatical text. Their implementation
of Multi-Layer Perceptron (MLP) is called as Sliding Window Neural (SwiN)
which was speci cally developed for grammatically problematic text where the
linguistic features could fail. The Deep Neural Framework was developed by Yao
et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to identify the biomedical named entities. They have trained the word
representation model on PubMed database with the help of skip-gram model.
Xia et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] built a single neural network for identifying multi-level nested
entities and non-overlapping NEs. Kuru et al.[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] used character level
representation to identify named entities. They have utilized Bi-LSTMs to predict the tag
distribution for each character. Wei et al.[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] have developed a CRF based neural
network for identifying the disease names. Along with word embeddings their
system has utilized words, PoS information, chunk information and word shape
features. Hong et al.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] developed a deep learning architecture for BioNER which
is called as DTranNER. It learns the label to label transition using the contextual
information. The Unary-Network concentrates on the tag-wise labelling and the
pair-wise network predict the transition suitability between labels. Then these
networks are plugged into the CRF of the deep learning framework. In the
recent past, the models combining word level and character lever representations
are being used. These methods concatenate word embeddings with LSTMs (or
Bi-LSTMs) over the characters of a word, passing this representation through
another sentence-level Bi-LSTM, and predicting the nal tags using a nal
softmax or CRF layer. Lample et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] introduced this architecture and achieved
85.75%, 81.74%, 90.94%, 78.76% F- scores on Spanish, Dutch, English and
German NER dataset respectively from CoNLL 2002 and 2003. Dernoncourt et al.
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] implemented this model in the Neuro NER toolkit with the main goal of
providing easy usability and allowing easy plotting of real time performance and
learning statistics of the model.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>In this section we present our systems developed using Condition Random Fields
(CRFs) and Arti cial Neural Networks (ANN). For our work we use CRF++
tool and Scikit python package. CRF++1 tool is an open source implementation
of CRFs and is a general purpose tool. Our NER system follows a pipeline
architecture, where the data is rst pre-processed to required format that is
needed to train the system. After training the system the NEs are automatically
identi ed from the test set.
3.1</p>
      <sec id="sec-3-1">
        <title>Feature Selection</title>
        <p>Feature selection is an important step in the ML approach for NER. Features
play an important role in boosting the performance of the system. Features
selected must be informative and relevant. We have used word level features,
grammatical features, functional terms features that are detailed below:
1. Word level features: Word level features include Orthographical
features and Morphological features.</p>
        <p>(a) Orthographical features contain Capitalization, combination of digits,
symbols and words and Greek words.</p>
        <p>(b) Pre x and su x of chemical entities is considered as morphological
features.</p>
        <p>2.Grammatical features: Grammatical features include word, PoS, chunks
and combination of word, PoS and chunk.</p>
        <p>3. Functional term feature: Functional term helps to identify the
biological named entities and categorize them to various classes. Example: Alkyl, acid,
alkanylene.</p>
        <p>Grammatical features of Part-of-Speech (PoS) and chunk information are
obtained using automatic tools. More details about the tools are given in next
sub-section. The morphological features are obtained by extracting `n' last and
starting characters of chemical entities. After performing a few experiments,
it was identi ed that n=4 is the optimal one. A Functional terms lexicon was
collected from online sources.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Pre-processing</title>
        <p>The data is pre-processed using a sentence splitter and tokenizer and is
converted into column format with entities tagged using the le containing detailed
chemical mention annotation (BRAT format annotation le). The sentence
splitter and the tokenizer used are rule based engines, developed in-house. In these
1 https://taku910.github.io/crfpp/
engines we have done a modi cation by adding a special processing to
accommodate long entity names with more than 200 characters. We have split these
long names into two tokens and then combined it as one after PoS tagging and
Phrase chunking is completed.</p>
        <p>
          Then the data is annotated for syntactic information of Part-of-speech (PoS)
and Phrase Chunk information (Noun Phrase, Verb phrase) using fnTBL tool
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], an open source tool.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Named Entity Identi cation</title>
        <p>The features from the pre-processed data are extracted as explained in section
3.1.The data format is similar for both the algorithms. The systems are trained
using the same features extracted from the data. Using these models the
chemical named entities from the test data were automatically identi ed. Chemical
entity mention in patents requires the detection of the start and end indices
corresponding to all chemical entities. Hence we converted the output from the
system to the required BRAT format for task submission. The NER language
models developed using CRFs, used the features as explained in section 3.1 from
training. Using the NER model the NEs are automatically identi ed from the
testing corpus. All the features were extracted from the training corpus provided
by the organizers and no other external resources have been used. Similarly, we
have followed for the ANN system also. ANN system is described below.
Arti cial Neural Networks (ANN) A Multi-Layer Perceptron (MLP) is
a feed forward Arti cial Neural Network (ANN). The input layer receives the
input data in the numerical form, the output layer takes the decision about the
input and the hidden layers exists between these two acts as the computational
engine.</p>
        <p>The three important steps involved in neural network are 1) each input is
multiplied by a weight 2) all the weighted inputs are added together with the
bias and 3) the sum is passed through the activation function. The input node
accepts the information in a numerical form and depending on the weight and
the transfer function, the activation value which is the weighted sum of inputs
will be passed to the next node. The activation function is used to monitor the
threshold level and convert the unbounded value into a bounded one. Each node
in the network runs the activation value and tweak the sum based on its transfer
function. The activation function runs through the entire network until it reaches
an output node. The traditional systems used sigmoid or the hyperbolic tangent
activation function.</p>
        <p>In this work, Multilayer Perceptron (MLP) network is used. MLP is a
combination of layers of perceptrons connected together. The rst layer's output
goes as the input to the next layer until it reaches the output layer. The hidden
layer is the layer that exists between the input and output layer Feed forward
networks like MLP have two passes, namely forward and backward passes. In
the forward pass, propagate forward to get the output and compare it with the
real value to get the error. In order to reduce the error multilayer perceptrons
propagate backward and adjust the weights. This process of back-propagation is
used to adjust the weight and bias relative to the error. This process continues
until the estimated output is obtained. ReLu activation function is used in MLP.
The feed forward network consists of two motions namely forward and backward
pass. The training process comprises of 3 steps, they are forward pass, error
calculation and backward pass. In forward pass the input data is multiplied by its
weight and added to the bias at each node and passes through the hidden layer
to the output layer. The cost function is used to predict the performance of the
model, which can be computed as the di erence between the predicted and the
expected value. Once the loss is calculated we have to back propagate the loss
in order to update the weights of each node using the gradient descent function.
The weights are being tweaked according to the gradient ow in that direction.
The main intention here is to minimize the loss.</p>
        <p>In this work we have used the python package of Scikit-Learn's Multi-Layer
Perceptron. The process of converting the input data into numerical feature
vectors is called as vectorization and it involves mainly three steps namely
tokenization, counting and normalization. The resultant data is called as Bag of
words representation. In this form rather than the relative position of the words
in the document the input text is represented using the word occurrences. We
have used the countvectorizer to represent the data into bag of words format. It
converts the text data into the numerical features. The TFIDFVectorizer is used
to convert the bag of words into the matrix of TFIDF features. After initiating
the size of the hidden layer and determining the activation and optimization the
data is given for the training process. The ReLu activation function is used for
the hidden layers of the present MLP implementation. The stochastic gradient
optimizer Adam is used for the weight optimization.</p>
        <p>The number of layers we used for the hidden layer is 30, the activation
function utilized for the hidden layer is ReLu (Recti ed Linear unit function), `adam'
stochastic gradient-based optimizer is used as a solver for the weight
optimization. 'Alpha' regularization parameter is set to 0.0001. The learning rate schedule
used for weight updation is 'constant'. It is the constant learning rate provided
by the initial learning rate which helps to control the step-size in updating the
weights and the 'learning rateinit' value is set as 0.001. Batch size refers to the
number of training examples used in one iteration. The batch size is assigned
as mini batches for stochastic optimizers and it is set to 200 by default. The
architecture of MLP implementation is shown in gure 1.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and Discussion</title>
      <p>The system outputs on the test data were evaluated by the track organizers,
precision, recall and F score were calculated. The test results are tabulated in
Table 1 and 2. Table 1 provides the test results of the system developed using
CRFs and Table 2 presents results of system using ANN.</p>
      <p>
        The system based on CRFs had given a very good precision. The recall is
low and especially for the entities \YIELD OTHER" and \YIELD PERCENT".
This could have been improved by using post processing rules. The results
obtained using ANN are lower than the CRFs based system. This clearly shows that
the training data size is not su cient for ANN. The ANN system requires used
of external resources such as pre-trained word embeddings and other available
annotated resources.
We submitted two systems developed using Machine Learning (ML) techniques,
Condition Random Fields (CRFs) and Arti cial Neural Networks (ANN). A two
stage pre-processing is done on development data 1) the formatting stage, where
the sentence splitting, tokenizing and the data conversion to column format and
2) the data annotation stage, where the data is annotated for syntactic
information of Part-of-speech (POS) and Phrase Chunk information (Noun Phrase,
Verbphrase) are performed. For both POS and Chunk information, fnTBL [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
an open source tool is used. We have used the training data provided by the
task organizers and have not used any external resources or pre-trained
language models. The language models are developed using CRFs and ANN. The
CRF++ tool is used for developing the CRF model. The ANN application uses
the Scikit python package. ANN uses a Multilayer Perceptron (MLP). ReLu
activation function is used in MLP. The stochastic gradient optimizer Adamis
used for the weight optimization. It can adjust and calculate the learning rates
for di erent parameters at each node. We obtained an F-score of 0.6640 using
CRFs and F-Score of 0.3764 using ANN for the test data. It can be observed
from the results CRFs performed better for the given training data. This shows
ANN's require more training data or pre-trained models. A better solution can
be arrived at by combining ANN and CRFs. In future we would like to combine
ANN and CRFs.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Franck</given-names>
            <surname>Dernoncourt</surname>
          </string-name>
          , Ji Young Lee, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Szolovits</surname>
          </string-name>
          .:
          <article-title>Neuroner: an easy-to-use program for named-entity recognition based on neural networks</article-title>
          .
          <source>In: arXiv preprint arXiv:1705.05487</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gallo</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Binaghi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carullo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lamberti</surname>
          </string-name>
          , N.:
          <article-title>Named entity recognition by neural sliding window</article-title>
          .
          <source>In:The Eighth IAPR International Workshop on Document Analysis Systems</source>
          , pp.
          <volume>567</volume>
          {
          <fpage>573</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J. G.</given-names>
          </string-name>
          :
          <article-title>DTranNER: biomedical named entity recognition with deep learning-based label-label transition model</article-title>
          .
          <source>BMC Bioinformatics</source>
          <volume>21</volume>
          (
          <issue>1</issue>
          ),
          <fpage>53</fpage>
          -
          <lpage>73</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kuru</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Can</surname>
            ,
            <given-names>O. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuret</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Charner:
          <article-title>Character-level named entity recognition</article-title>
          .
          <source>In:Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics</source>
          , pp.
          <volume>911</volume>
          {
          <issue>921</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>La</surname>
            <given-names>erty</given-names>
          </string-name>
          , J.,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Conditional random elds: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In: Proceedings of International Conference on Machine Learning</source>
          , pp.
          <volume>282</volume>
          {
          <issue>289</issue>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Guillaume</given-names>
            <surname>Lample</surname>
          </string-name>
          , Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer.:
          <article-title>Neural architectures for named entity recognition</article-title>
          .
          <source>In: arXiv preprint arXiv:1603</source>
          .
          <fpage>01360</fpage>
          . (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Grace</given-names>
            <surname>Ngai</surname>
          </string-name>
          and
          <string-name>
            <given-names>Radu</given-names>
            <surname>Florian</surname>
          </string-name>
          .
          <article-title>Transformation-based learning in the fast lane</article-title>
          .
          <source>In: Proceedings of North Americal ACL 2001</source>
          , pages
          <fpage>40</fpage>
          {
          <fpage>47</fpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Max</surname>
          </string-name>
          , Valentinuzzi.:
          <article-title>Patents and Scienti c Papers: Quite Di erent Concepts</article-title>
          .
          <source>IEEE Pulse</source>
          <volume>8</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>49</fpage>
          -
          <lpage>{</lpage>
          53 (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Gui</surname>
          </string-name>
          .:
          <article-title>Disease named entity recognition by combining conditional random elds and bidirectional recurrent neural networks</article-title>
          .
          <source>Database</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>Multi-grained named entity recognition</article-title>
          . In: arXiv preprint arXiv:
          <year>1906</year>
          .
          <fpage>08449</fpage>
          . (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            ,
            <surname>Anwar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. W.</surname>
          </string-name>
          :
          <article-title>Biomedical named entity recognition based on deep neutral network</article-title>
          .
          <source>International Journal of Hybrid Information Technology</source>
          <volume>8</volume>
          (
          <issue>8</issue>
          ), pp.
          <volume>279</volume>
          {
          <issue>288</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>