<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Khushleen@IECSIL-FIRE-2018: Indic Language Named Entity Recognition Using Bidirectional LSTMs with Subword Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shaheed Udham Singh College of Engineering</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Technology Kharar Banur Highway</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tangori</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Punjab</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India khushleendhanoa@gmail.com</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Named Entity Recognition generally requires large amount of tagged corpus to build a high performing system. The representation has always been a bottleneck in NERs success. The NER subtask by IECSIL had enough data for algorithms to learn semantic representation as well as apply deep learning models. The current work uses a subword aware word representation for generating representations. These embeddings are further used with a bidirectional LSTM for building an NER system. The system performed well for all the Indian languages and stood among top three submissions.</p>
      </abstract>
      <kwd-group>
        <kwd>Indic Languages</kwd>
        <kwd>Bidirectional LSTMs</kwd>
        <kwd>Subword Informa- tion</kwd>
        <kwd>word embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The most celebrated approaches for Named Entity Recognition has either been
Conditional Random Fields or Support Vector Machines with feature engineering
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The recent advancements in representation learning as well as neural network
algorithms have opened doors for various new possibilities. A word representation
learned on su cient data followed by a suitable deep learning algorithm can
outperform the existing state of the art approaches.
      </p>
      <p>
        The representations learning algorithms plays a crucial role in determining
system performance. The word embedding methods like word2vec by Mikolov et.
al [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and GloVe by Pennington et. al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] has helped achieve much better results
than ever before. Both of these famous embedding algorithms doesn't take into
account the subword information. The vector representation proposed by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is an
extension to Mikolovs skip gram model but it includes character n-grams which
subsequently represent words as sum of these character n-grams. These character
level methods also makes it possible to learn embeddings for rare words which
are not generally poorly trained. The representation techniques alone cannot win
the battle of better performance for us. It requires a suitable algorithm which can
leverage afaorementioned character level subword information. The sequence-in
sequence-out deep learning architecture of Recurrent Neural Networks (RNNs),
more speci cally, Long Short Term Memory (LSTM) is just the right choice
for such requirement. We used the word embeddings with subword information
followed by Bidirectional LSTM for our architecture development.
      </p>
      <p>
        Since the shared task [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] focused only on Indian languages mainly Hindi,
Tamil, Kannada, Malayalam &amp; Telugu, considering subword information helps
learn the morphological word representations.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Corpus Statistsic</title>
      <p>
        The corpora provided by the shared task organizers was in 5 languages, namely,
Hindi, Tamil, Malayalam, Kannada &amp; Telugu [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The corpus size was su cient
to leverage deep learning techniques. The corpus per language was segregated
into three parts such as Training which was 60%, Testing phase-1 20 % and
Testing phase-2 20 %. The phase-2 test corpus was used to nally rank the
submitted systems. The training &amp; testing statistic are provided in the Table 1.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>
        Word embeddings with subword information
The neural network based word representations were proposed by Collobert and
Weston [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] which used a simple feed forward network. It doesn't really captures
a long range relationships among words. The distributional representation
technique proposed by mikolov more recently uses a log bilinear model to learn the
continous word representations. It only works when you have a very large data
to learn the representations e ciently.
      </p>
      <p>
        The aforementioned techniques represent each word in the vocabulary as a
unique vector. It doesn't allow parameter sharing among the words. The
morphological structure is hard to capture this way since agglutinative languages
contains many word forms that hardly occur in the training data. A good
representation can be learned if all these word forms are considered while learning
continous vector representation. Since it is not possible to have all the word
forms for morphologically rich languages in training corpora, using character
level information will help impove the word representation. It is observed by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
that including characeter level information does help include rare words or out
of vocabulary word representations from the given corpora. Basically, a word is
represented by vector sum of its character n-grams. The scoring function thus
obtained is,
s(w; c) = X
      </p>
      <p>t
zgvc
g Gw
(1)
where, w si a word which is represented as a bag of character n-grams, G refers
to n-gram collection and zg is vector representation for each n-gram g.
3.2</p>
      <p>
        Bidirectional Long Short Term Memory (BiLSTM)
The Recurrent Neural Network [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] family of neural networks is the de-facto
standard to use when you have a sequence to deal with. Unlike Convolutional Neural
Network, they take dynamic size sequence and also retain long range
dependencies. The vanilla RNN su ers from the vanishing/exploding gradient drawbacks
which maked it hard for these algorithms to learn longer-range dependencies.
The LSTM [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] networks were designed for a situation like this. It is a particular
type of RNN which works better than RNNs owing to its more powerful update
equation and a slightly di erent backpropagation. LSTM has the capability of
selectively remember (add) or forget (remove) information owning this feature
to carefully regulated gates, namely, input gate, forget gate &amp; output gate.One
additional feature in LSTMs is the another amazing ability of reading the input
sequence either unidirectionally or bidirectionally. In bidirectional case, it reads
the sequence left to right as well as right to left. It does takes more memory but
it has proven better results. The basic structure of a LSTM cell is depicted in
Fig 1.
      </p>
      <p>Subword word embedding with Bidirectional LSTM
The implementation for this work was completed in two steps. First the text
corpora for all the languages was processed throught the word embedding module
fastText1 per languages. The parameters used for all the languages were kept
same in order to make a uni ed model. These 300 dimensional continous vector
representations were then fed to a 2 layer BiLSTM with each layer having 64
neurons each. The number of epoch asn batch size used were 35 and 128
respectively. The BiLSTM architecture topology for all the languages were kept same
to make the model uni ed and language independent.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>The uni ed system developed for the shared task performed well across all the
languages. The results reported by the organizers are shown in Table 2. It can be
Team</p>
      <p>hilt
raiden11
SSN NLP</p>
      <p>hilt
am905771
idrbt-team-a</p>
      <p>SSN NLP
khushleen</p>
      <p>Ajees
hariharanv
rohitkodali
am905771
SSN NLP
am905771
observed from the results (in bold) that the system performance was comparative
for all the languages. It means the same model can be ported to other Indian
languages.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>The NER problem from NLP requires a lot of tagged corpus to build a decent
system. In this shared task, used a subword (character n-grams) aware word
representation for generating representations. These embeddings were further
1 https://fasttext.cc/
used with a bidirectional LSTM with common setting acrosss all the languages to
build a uni ed model. The developed system performed well and gave consistent
results across all the Indian languages. The same model can be used for other
Indian languages.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ratinov</surname>
          </string-name>
          , Lev and Roth, Dan:
          <article-title>Design challenges and misconceptions in named entity recognition</article-title>
          .
          <source>Proceedings of the Thirteenth Conference on Computational Natural Language Learning</source>
          ,
          <volume>147</volume>
          {
          <fpage>155</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <article-title>Tomas and Chen, Kai and Corrado, Greg and Dean, Je rey: E cient estimation of word representations in vector space</article-title>
          .
          <source>In: arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Pennington</surname>
          </string-name>
          ,
          <article-title>Je rey</article-title>
          and Socher, Richard and Manning, Christopher: Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          ,
          <volume>532</volume>
          {
          <fpage>1543</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <article-title>Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas: Enriching word vectors with subword information</article-title>
          .
          <source>arXiv preprint arXiv:1607.04606</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Collobert</surname>
          </string-name>
          ,
          <article-title>Ronan and Weston, Jason: A uni ed architecture for natural language processing: Deep neural networks with multitask learning</article-title>
          .
          <source>Proceedings of the 25th international conference on Machine learning 160{167</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <article-title>Sepp and Schmidhuber, Jurgen: Long short-term memory</article-title>
          .
          <source>Neural computation:</source>
          vol 9
          <volume>1735</volume>
          {
          <issue>1780</issue>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lipton</surname>
          </string-name>
          ,
          <string-name>
            <surname>Zachary</surname>
            <given-names>C</given-names>
          </string-name>
          and
          <article-title>Berkowitz, John and Elkan, Charles: A critical review of recurrent neural networks for sequence learning</article-title>
          .
          <source>arXiv preprint arXiv:1506.00019</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Barathi</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H B</given-names>
            and
            <surname>Soman</surname>
          </string-name>
          ,
          <string-name>
            <surname>K P</surname>
          </string-name>
          and Reshma, U and Mandar, Kale and Prachi, Mankame and Gouri, Kulkarni and Anitha, Kale and
          <string-name>
            <given-names>Anand</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M:</surname>
          </string-name>
          <article-title>Information Extraction for Conversational Systems in Indian Languages - Arnekt IECSIL</article-title>
          .
          <article-title>Forum for Information Retrieval Evaluation (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Barathi</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H B</given-names>
            and
            <surname>Soman</surname>
          </string-name>
          ,
          <string-name>
            <surname>K P</surname>
          </string-name>
          and Reshma, U and Mandar, Kale and Prachi, Mankame and Gouri, Kulkarni and Anitha, Kale and
          <string-name>
            <given-names>Anand</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M</surname>
          </string-name>
          , Kale: Overview of Arnekt IECSIL at FIRE-2018
          <source>Track on Information Extraction for Conversational Systems in Indian Languages. FIRE (Working Notes)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>