<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HiLT@IECSIL-FIRE-2018: A Named Entity Recognition System for Indian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sagar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Srinivas P Y K L</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rusheel Koushik Gollakota</string-name>
          <email>rusheelkoushik.g16g@iiits.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amitava Das</string-name>
          <email>amitava.das@mechyd.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Information Technology</institution>
          ,
          <addr-line>Sri City, Andhra Pradesh 517646</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe our submission for FIRE 2018 Shared Task on Named Entity Recognition in Indian Languages[3]. Identi cation of Named Entities is important in several higher language technology systems such as Information Extraction systems, Machine Translation systems, and Cross-Lingual Information Access systems. The system makes use of the di erent contextual information of the words along with the Word-Level and Character-Level features that are helpful in predicting the various Named Entity Classes. This model is an end to end Language Independent Deep Learning Model for Named Entity Recognition to support all Indian Languages. Our model uses no domain-speci c knowledge and no handcrafted rules of any sort to attain good results. The model solely relies on semantic information in form of word level embedding and character level embedding learned by unsupervised learning algorithms so that this model can be replicated across other languages.We have used 2CNN+LSTM model consisting of CNN as character-level encoder, CNN as word-level encoder and BiLSTM as the Tag Decoder.</p>
      </abstract>
      <kwd-group>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Named Entity Recognition (NER) is an important tool in almost all Natural
Language Processing applications. Proper identi cation and classi cation of Named
Entities are very crucial and pose a very big challenge to the NLP researchers.
This work focuses on NLP approaches for identifying Named Entities such as
Name, Number, Date, Event, Location, Things , Organization and Occupation.
NER is an important step to various NLP Applications like Machine
Translation, Question Answering , Topic Modelling , Information Extraction and many
more. Further there is no concept of capitalization in Indian Languages like we
have in English and thus this makes NER task more di cult and challenging as
the rules prepared for English cannot be applied directly to Indian Languages
, so we have come up with a Language Independent Deep Learning Model for
Named Entity Recognition. In this paper, we use a simple 2CNN+LSTM model
to predict the tag of a word in a sentence built using character embeddings and
word embeddings in order to extract various features at the Character-Level and
Word-Level respectively.</p>
      <p>The rest of the paper is organized as follows : In Section 2 we discuss about
the dataset for our task. In Section 3 we discuss about our methodology and
describe our model for Named Entity Recognition. In Section 4 we discuss about
various Experiments and Results. In Section 5 we nally conclude our paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        NER task has been extensively studied in the literature of Natural Language
Processing. Previous approaches in NER can be classi ed into Rule Based
Approaches and Learning based approaches. Further Learning Based approaches
can be divided into Machine Learning Based Approaches and Deep Learning
Based Approaches. Another model was built for NER using large lists of names
of people, location etc. in 1996[8]. Rule based system had a big disadvantage
that a huge list was needed to be made for all the named entities before
determining their Named Entity Class. They didn't had the capability of detecting
a new Named Entity and determining their class if they are not already in the
dictionary. Then comes the learning based approaches where people use
feature learning based approach using Hand Crafted Rules like Capitalization ,etc.
These features were given to machine learning based classi ers like SVM, Naive
Bayes,etc. which were then used to classify into di erent classes[7]. But this is a
problem with Indian language itself. Indian languages show frequent variation in
nature, so very big dictionaries cannot be made. Further there is a problem in the
language based approaches, since these language do not have capitalization and
other things which can be used to easily classify. Further it was also treated as
sequence labelling problem because the context is very important in determining
the entities. HMM and CRF were used to solve such problems[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Further lately,
deep active learning was used in order to gather more details, where character
level and word level encoders were also used for named entity recognition in
English Language. Deep learning approaches have also been applied using LSTM
to predict the Named Entities and their class in Indian Languages[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>
        The labelled training for Hindi, Telugu, Tamil, Malayalam and Kannada was
provided by FIRE 2018 where every word in the sentence was tagged with their
corresponding NER Tag [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Further for pre-evaluation as well as nal-evaluation
the unlabelled data was also provided for all the ve Indian Languages.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Data Format</title>
        <p>Every line in the training le had one word and their corresponding tag separated
by a tab space and a sentence ended with a "newline" and no tag corresponding
to it. The testing set data format was almost the same as training set, i.e. every
line had a word but it didn't had a corresponding tag. And in the testing set, a
sentence also ended with the same "newline".
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Data Statistics</title>
        <p>We have built a 2CNN+LSTM model for this task of NER. For the task we have
used word level and character level features of a word. We have used CNN to
determine the Character Level Feature Vector and Word Level Feature Vector.
For this task LSTM can also be used in place of CNN but since [6], LSTM are
computationally much more expensive, so CNN was used. Fig. 1 describes the
complete work ow through the model. Our model consists of 3 components:
Initially, Character embeddings[5] are created for all characters in the words in
the training data. The length of the word varies in the training set so, each word
is made to a length of double the average length of the words in order to get
uniform length of each word. Words smaller than the given length are padded
and words longer are cut down . Further, Character tokenization is also done in
order to rank characters according to their frequency in all words. Only top 75%
of the characters are retained and all other characters are replaced by unknown
&lt;UNK CHAR&gt; tag.</p>
        <p>The Character Level Encoder, extracts the character level features of a word.
A sentence is taken as input and every word of the sentence is mapped to the
corresponding characters in that word. Every character in a word is further
mapped to corresponding character embedding . A word matrix is formed by
combining all the character embedding to form a matrix for every word. Then
CNN is applied over every word matrix formed by combining the character
vectors of every word. Further a Max-Pooling layer is applied over every word
matrix to get a Character Level Feature Vector.
4.2</p>
      </sec>
      <sec id="sec-3-3">
        <title>Word Level Encoder</title>
        <p>Word embeddings[5] are created for all the words using word2vec. We need to
bring the sentences to uniform length, so sentences are made to a length double
the average length of sentences in the training set. Sentences smaller than the
given length are padded and sentences longer and cut down to this length. All
the sentences are then tokenized and the words are ranked according to their
frequency in the document . Further, in order to handle the case of unknown
words the top 75 % of the word are retained and the remaining words are replaced
by unknown &lt;UNK WORD&gt; tag.</p>
        <p>Word level encoder extracts the word level features from the surrounding
words in a sentence. A sentence is taken as input and every word of the sentence
is mapped to its word embeddings. Word embedding of a word is concatenated
with the character level feature vector of that particular word. A matrix is formed
by combining all the vectors of words to form a sentence-matrix. Then CNN is
applied over the sentence-matrix. Max pooling layer is not applied because we
need the same number of vectors as output as the length of sentence in order to
determine the NER Tag of every word in the sentence.
4.3</p>
      </sec>
      <sec id="sec-3-4">
        <title>Tag decoder</title>
        <p>
          The Tag Decoder induces a probability distribution [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] over a sequence of tags in
a sentence. A sentence is taken as input and every word of the sentence is mapped
to its word embedding. Now the output which we got from the Word-Level CNN
was concatenated with the corresponding word embeddings. Now this is given
as the input to the Bidirectional LSTM. Further the Bi-LSTM determines the
output tag of all the words in the sentence. Output of every word is passed
through 3 projection layers and Softmax is applied on the last projection layer
to determine the corresponding tag.
        </p>
        <p>Fig. 1: Language Independent Named Entity Recognition Model</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>Our model is a non-linear, and we need to tune a number of hyper parameters for
an optimal performance. Hyper parameters are described as follows. The
Character Level Embedding Size determines the length of the character embedding
vector. We have used the embedding size to be 50 . The Word Embedding Size
determines the length of Word Embedding Vector and we have set the embedding
size to be 100 with window size of 5. Following this we selected top words and
top characters after tokenization and marked all other words as UNK WORD.
We chose top 75% of the words for tokenization. We have used 2 CNN Layers
with 20 and 11 lters respectively for extracting Character Level features in a
word respectively. Then we have applied another CNN layer with 10 lters for
extracting Word Level Features in a sentence. The number of layers of Bi-LSTM
as a tag decoder and the number of hidden units in Bi-LSTM which are used
to determine the tag of a sentence is also one of the important parameter. We
have used just one layer of Bi-LSTM with hidden units equal to 300. Number
of layers in the feed forward network and number of cells in each layer is also
one of the hyper parameters. We have used 3 dense layers, the rst and second
dense layers have 100 cells and the third layer having the number of cells equal
to number of classes we need to categorize.
5.1</p>
      <sec id="sec-4-1">
        <title>Training</title>
        <p>Adam optimizer is applied with "categorical crosentropy" loss. We have splitted
data into Training and Validation data validation split of 20% Now the model is
trained for 25 epochs with an early stopping of patience 2 on validation loss. On
this, most of the languages we got maximum accuracy in around 7 to 9 epochs
and after that over tting started.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Testing</title>
        <p>For testing all the word not in the dictionary of the system are marked with
unknown &lt;UNK WORD&gt; and similarly for all the characters not in the list are
marked as unknown &lt;UNK CHAR&gt;. Further all the sentences are limited to
double the average length of sentence which we had xed earlier during training.
Similarly all the words are limited to double the average length of words which
we had xed earlier during training. Results are predicted for the sentence for
the length of double the average length of a sentence. For the remaining words
the "other" tag is marked.</p>
        <p>Table 3 describes our F1 Score which we achieved in all these ve languages
for both pre-evaluation and nal-evaluation for the given ve Indian Languages.
Table 4 describes describes the F1 Score of all NER tags for all ve Indian
Languages.
We have developed a named entity recognition system using deep learning for
ve di erent languages Hindi, Telugu, Tamil, Malayalam, Kannada. We have
considered the Character Level and Word Level features. The model is a language
independent model for Named Entity Recognition for Indian Languages which
performs better than rule based approaches. This model can be extended further
to many other Indian Languages as this approach does not require much of
annotated data and also does not require an expert from that language to form
the rules for Named Entity Recognition. Evaluation results show that the model
performs slightly better for Hindi over other languages.
guages". In "Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition
for South and South East Asian Languages", "2008".
5. Tomas Mikolov and Ilya Sutskever. Distributed representations of words and phrases
and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z.
Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing
Systems 26, pages 3111{3119. Curran Associates, Inc., 2013.
6. Yanyao Shen, Hyokun Yun, Zachary C. Lipton, Yakov Kronrod, and Animashree
Anandkumar. Deep active learning for named entity recognition. CoRR,
abs/1707.05928, 2017.
7. Q Tri Tran, TX Thao Pham, Q Hung Ngo, Dien Dinh, and Nigel Collier. Named
entity recognition in vietnamese documents. Progress in Informatics Journal, 5:14{
17, 2007.
8. Takahiro Wakao, Robert Gaizauskas, and Yorick Wilks. Evaluation of an algorithm
for the recognition and classi cation of proper names. In Proceedings of the 16th
conference on Computational linguistics-Volume 1, pages 418{423. Association for
Computational Linguistics, 1996.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Vinayak</given-names>
            <surname>Athavale</surname>
          </string-name>
          , Shreenivas Bharadwaj, Monik Pamecha, Ameya Prabhu, and
          <string-name>
            <given-names>Manish</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          .
          <article-title>Towards deep learning in hindi NER: an approach to tackle the labelled data sparsity</article-title>
          .
          <source>CoRR, abs/1610.09756</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>2. H B Barathi Ganesh</surname>
            ,
            <given-names>K P</given-names>
          </string-name>
          <string-name>
            <surname>Soman</surname>
            , U Reshma, Kale Mandar, Mankame Prachi, Kulkarni Gouri, Kale Anitha, and
            <given-names>M Anand</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>Information extraction for conversational systems in indian languages - arnekt iecsil</article-title>
          .
          <source>In Forum for Information Retrieval Evaluation</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>3. H B Barathi Ganesh</surname>
            ,
            <given-names>K P</given-names>
          </string-name>
          <string-name>
            <surname>Soman</surname>
            , U Reshma, Kale Mandar, Mankame Prachi, Kulkarni Gouri, Kale Anitha, and
            <given-names>M Anand</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>Overview of arnekt iecsil at re-2018 track on information extraction for conversational systems in indian languages</article-title>
          .
          <source>In FIRE (Working Notes)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Asif</surname>
          </string-name>
          "Ekbal, Rejwanul Haque,
          <string-name>
            <surname>Amitava Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Venkateswarlu Poka</surname>
          </string-name>
          ,
          <article-title>and Sivaji" Bandyopadhyay. "language independent named entity recognition in indian lan-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>