<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IIT(BHU)@IECSIL-FIRE-2018: Language Independent Automatic Framework for Entity Extraction in Indian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Akanksha Mishra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rajesh Kumar Mundotiya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sukomal Pal</string-name>
          <email>spal.cseg@itbhu.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology</institution>
          ,
          <addr-line>Varanasi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>The paper discusses about our work submitted to the track IECSIL [3] organized by ARNEKT in conjunction with Forum for Information Retrieval and Evaluation 2018. The track primarily focuses on developing language independent system for information extraction on Indian languages. We focused on the identi cation and categorization of entities in the text. We used word embedding for the feature representation. We proposed Bidirectional LSTM recurrent neural network model for entity extraction from the provided text for the ve Indian languages such as Hindi, Kannada, Malayalam, Tamil and Telugu. The proposed technique is evaluated in terms of two metrics, accuracy and F1-score.</p>
      </abstract>
      <kwd-group>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Recurrent Neural Network</kwd>
        <kwd>Word Embedding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the growing information and huge availability of structured and
unstructured data, it is important to extract relevant information from the available
text. The relevant information is further identi ed and useful keywords are
extracted within the digital texts. The term which de nes the actual meaning of the
sentence and acts as important nouns in the sentence are extracted for
information. These nouns are further categorized to determine datenum, event, location,
name, number, organization, occupation, things and other information.
In the subsequent sections, we discuss objective of the task, statistics about
corpus, present an overview of the framework, and analyze performance of the
developed system.
We conducted literature review for one of the fundamental task in the domain
of Natural Language Processing. Named Entity Recognition can be performed
broadly using three approaches namely rule-based, machine learning and deep
learning based approaches. However, there is requirement of language expertise
to develop rule-based techniques for entity extraction.</p>
      <p>
        Asif et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed statistical conditional random elds (CRFs) based
approach for some of the Indian Languages. The system used both language
dependent and independent features. Also, linguistic features for some of the languages
were extracted from gazetteers list. Another work done by Asif et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] used
support vector machine based approach for the named entity recognition for
Hindi and Bengali Languages. Also, lexical context patterns were generated
using unsupervised algorithms. Sujan et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] developed named entity recognition
system for Hindi language using hybrid set of features for Maximum Entropy
(Max-Ent) method.
      </p>
      <p>
        A survey about Named Entity Recognition systems for Indian and Non-Indian
Languages is done by Nita et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. They summarized di erent rule-based and
statistical techniques used for the Indian Languages. Deep learning based
approaches are explored for English and some other languages for named entity
recognition however not much work is done generic to Indian Languages. Vinayak
et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed recurrent neural network based approach for named entity
recognition in English and Hindi language.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Task Description and Corpus</title>
      <p>There are two tasks scheduled in ARNEKT-IECSIL track for information
extraction. First task is to identify and categorize entity within the conversational
systems. Second task is to determine relation between the entities extracted by
the rst task. We built the system for entity extraction only.</p>
      <p>
        The corpus comprises ve Indian languages (Hindi, Kannada, Malayalam, Tamil
and Telugu) and is provided by the task organizers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Corpus is divided into
Train (60%), Test-1 (20%) and Test-2 (20%) set. Table 1 lists statistics related
to Train, Test-1 and Test-2 set for Named Entity Recognition. The corpus is
built in such a way that it will support systems built independent of languages.
This section discusses about the implementation of our approach. The
architecture of our model is shown in Figure 1. In our proposed technique, we generated
vector representation of words and tags which is fed to the bidirectional Long
Short-Term Memory recurrent neural network to train the model and predict
tags for test data. This section further describes about the feature
representation and model description.
4.1
      </p>
      <sec id="sec-2-1">
        <title>Feature Representation</title>
        <p>
          Word embedding is developed to capture information about the words in the
text corpus using Word2Vec [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ]. We used continuous bag of words training
algorithm to learn the word embedding. We considered all the words present in
the corpus to generate word embedding. Each sentence is tokenized and further
each token is represented into 100-dimensions to construct the embedding.
Di erent categories of entity is represented as binary vectors using One Hot
Encoding. This requires all classes to be represented as integers which is further
represented as binary vector where presence of integer is marked as 1 otherwise 0.
4.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Model Description</title>
        <p>Word representation discussed earlier is used as input to BiLSTM layer. Our
proposed model uses BiLSTM to learn the contextual relationship between words
from past and future context. Word representation obtained during feature
representation stage is given as input to the model. BiLSTM layer uses 2 LSTM
layers having number of units as 100 in each layer. One of the LSTM layer is
connected in the forward direction which takes input in the same sequence while
the other is connected in the backward direction which takes reverse copy of the
input sequence. Output obtained by both the layers is concatenated to produce
embedding for the input token. Further, we explore various activation functions
and varying recurrent dropout at BiLSTM layer.</p>
        <p>
          Dropout [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] with varying rate is applied between the BiLSTM layer and output
layer on the hidden neurons. The output obtained is fed to dense layer having 10
units for predicting classes of the entity. Our approach follows a series of steps
as given below in Algorithm 1.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>We used Keras1 neural network library to implement bidirectional Long
ShortTerm Memory recurrent neural network which uses either Tensor ow2 or Theano3
as backend. The model is trained for 10 epochs with batch size of 32.
Hyperparameters used in the model is shown in Table 2.
1 https://keras.io/
2 https://www.tensor ow.org/
3 http://deeplearning.net/software/theano/
Algorithm 1: Framework for Entity Extraction</p>
      <p>Data: Train and Test data</p>
      <p>Result: Predict type of entity for tokens in Test data
1 Read Train dataset and Test dataset
2 Generate nested list of words and tags of each sentence in Train dataset
3 Generate nested list of words of each sentence in Test dataset
4 Determine maximum length of sentence in Train and Test dataset
5 Generate list of unique tags and unique words
6 Represent unique tags using one hot encoding
7 Develop word embedding using nested list of words of Train dataset
8 Create dictionary with key as word if it exists and value as vector</p>
      <p>representation of that word
9 Perform padding on Train and Test dataset based on maximum length obtained
in Step 4
10 Train model comprises BiLSTM, Dropout and Dense layer
11 Predict tags for Test tokens</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>The evaluation of the proposed technique is carried on the Test-1 and Test-2
corpora provided by the task organizers. The evaluation is divided into two stages
to help the participants to test the system built in real time. Pre-evaluation
was performed on Test-1 corpora and nal-evaluation was performed on Test-2
corpora. Ranking is predicted based on the accuracy measure by taking average
of accuracy obtained for all Indian languages.</p>
      <p>Baseline system built on Naive Bayes Classi er was released by the task
organizers during pre-evaluation stage. We achieved 5.45% better accuracy as
compared to the baseline system for Test-1 corpora. Accuracy obtained during the
pre-evaluation stage for di erent languages is shown in Table 3. Accuracy is
calculated by comparing the result obtained by the proposed system with the
labelled data provided by the organizers.</p>
      <p>For analyzing the performance of the system on Test-2 corpora, task organizers
evaluated the submissions on two metrics, accuracy and F1-score. We submitted
three submissions for nal-evaluation on Test-2 corpora. Detailed statistics about
the accuracy obtained in di erent submissions for di erent Indian languages is
listed in Table 4. It is observed that Submission 3 performs better than other
two submissions. Also, F1-score is calculated by the task organizers for Test-2
corpora. Performances of our 3 submissions are shown graphically for F1 metric
in Figure 2, 3 and 4.
We proposed a fully automatic system for entity extraction in Indian Languages
namely Hindi, Kannada, Malayalam, Tamil and Telugu. We obtained 91.18%
accuracy during pre-evaluation stage on Test-1 corpora and 90.94% accuracy
during nal-evaluation stage on Test-2 corpora. The system can be improved
either by incorporating language speci c features or probably by representing
tokens at character level.
8</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>We would like to thank the organizers for giving us an opportunity to work on
the challenging task, providing us the guidelines and support.
8</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Athavale</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bharadwaj</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pamecha</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prabhu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shrivastava</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Towards deep learning in hindi NER: an approach to tackle the labelled data sparsity</article-title>
          .
          <source>CoRR abs/1610</source>
          .09756 (
          <year>2016</year>
          ), http://arxiv.org/abs/1610.09756
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Barathi</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.B.</given-names>
            ,
            <surname>Soman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.P.</given-names>
            ,
            <surname>Reshma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            ,
            <surname>Mandar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Prachi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Gouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Anitha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Anand</surname>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Information extraction for conversational systems in indian languages - arnekt iecsil</article-title>
          .
          <source>In: Forum for Information Retrieval Evaluation</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Barathi</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.B.</given-names>
            ,
            <surname>Soman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.P.</given-names>
            ,
            <surname>Reshma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            ,
            <surname>Mandar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Prachi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Gouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Anitha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Anand</surname>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Overview of arnekt iecsil at re-2018 track on information extraction for conversational systems in indian languages</article-title>
          .
          <source>In: FIRE (Working Notes)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bandyopadhyay</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Named entity recognition using support vector machine: A language independent approach</article-title>
          .
          <source>International Journal of Electrical, Computer, and Systems Engineering</source>
          <volume>4</volume>
          (
          <issue>2</issue>
          ),
          <volume>155</volume>
          {
          <fpage>170</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poka</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bandyopadhyay</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Language independent named entity recognition in indian languages</article-title>
          .
          <source>In: Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Patil</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patil</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pawar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Survey of named entity recognition systems with respect to indian and foreign languages</article-title>
          .
          <source>International Journal of Computer Applications</source>
          <volume>134</volume>
          (
          <issue>16</issue>
          ) (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarkar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A hybrid feature set based maximum entropy hindi named entity recognition</article-title>
          .
          <source>In: Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Dropout: a simple way to prevent neural networks from over tting</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>15</volume>
          (
          <issue>1</issue>
          ),
          <year>1929</year>
          {
          <year>1958</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>