<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Online Learning for Solving Data Availability Problem in Natural Language Processing?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>K.V.D.J.P. Kumarasinghe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Colombo School of Computing</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In Machine Learning(ML) in general, more data is better data. Besides, more data defeats better algorithms for performance improvement in most cases. In practice, access to such data may not be directly forthcoming but become available with time. Existing ML solutions for Sinhala Named Entity Recognition (NER) and Part of Speech (POS) tagging, typically retrain their models from scratch when new data becomes available. The training time required for this purpose increases proportionally to the overall dataset size. This research proposed two online ML models that do not require retraining their models from scratch when data is obtained in batches, namely an Online Conditional Random Fields (CRF) Model, and a Bidirectional Long Short Term Memory-Conditional Random Fields (LSTM-CRF) Model. A Sinhala NER experiment using the Online CRF model improved on previous attempts by an F1-measure of 31.5% to reach 76% while the Bidirectional LSTM-CRF model improved on a previous attempt by an F1-measure of 51.6% to reach 80%. In the Sinhala POS tagging experiment using the Online CRF model improved the accuracy from approximately 71 % to 76% while the Bidirectional LSTM-CRF model improved the accuracy from 70% to 76%. The training time consumed by the proposed online learning model remains constant in each incremental training step since the model is not built from scratch. Further, both the Online CRF model and the Bidirectional LSTM-CRF model achieved improvements over the state-of-the-art Sinhala POS tagging accuracy of 4%. Based on the extrapolation of these results it can be seen that the existing Sinhala NER dataset needs to approximately double in order to obtain state-of-the-art performance reported for English.</p>
      </abstract>
      <kwd-group>
        <kwd>Natural Language Processing</kwd>
        <kwd>Online Machine Learning</kwd>
        <kwd>Sinhala Language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Recent approaches to Named Entity Recognition (NER) and Part of Speech (POS)
tagging, use Machine Learning (ML) techniques [
        <xref ref-type="bibr" rid="ref11 ref14">14,11</xref>
        ]. The size of the corpus needed
for the training phase is the main influence in obtaining higher accuracies in such ML
models. For languages such as English, there already exist large numbers of datasets
to train models on. However, obtaining such large data sets at the initial stage for most
languages is impractical especially low resource natural languages such as Sinhala. In
practice, we have to obtain several (mini) batches of data at different time points, to
make a large dataset that is capable of producing better results. In general we obtain
data in ’mini-batches’ x1; x2; x3; :::; xn at different time steps t1; t2; t3; :::; tn. For this
kind of phenomenon, the existing ML techniques for NER and POS tagging, initially
train the model using x1 mini-batch. After obtaining the x2 mini-batch, we aggregate
x2 and x1 and then train the model using the aggregated data set x1 + x2. Similarly,
when x3 becomes available, the model trains using x1 + x2 + x3.
      </p>
      <p>
        This process indicates that the batch learning techniques retrain the ML models
using the same dataset multiple times. This retraining process results in a higher
training time. The main motivation to carry out this research is to avoid this overhead of
retraining, without losing the accuracy of the model. The Natural Language
Processing (NLP) models should need the capability to understand the present context of the
natural language. The proposed ML models should also adapt to the current natural
language context using the most recent data. The state-of-the-art methods for Sinhala
NER obtained nearly 92% precision [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Another objective of this research is to predict
how much data is needed to enhance this accuracy values. The problem of this research
causes due to the characteristics of the batch learning techniques. The proposed ML
models use online learning techniques because online learning can train an ML model
using incrementally collected datasets.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>NER and POS tagging</title>
        <p>The main approaches to NER and POS tagging can be stated as: Rule-based,
Datadriven, and Hybrid. In the rule-based approach, linguistic knowledge is used to create
a set of rules to identify named entities or part of speech tags. The data-driven
approach contrastingly depends on three main ML methods, namely, Supervised,
Semisupervised, and Unsupervised learning. The supervised learning methods use models
such as Conditional Random Fields (CRF), Maximum Entropy (MaxEnt), and Hidden
Markov Model (HMM) to build models from labeled data.</p>
        <p>
          CRF is an undirected graphical model and matches up with the conditionally trained
probabilistic finite-state automata. CRF is capable of including arbitrary features
easily because it trained conditionally. CRF model has been used over the HMM model
because CRF solves the label bias problem [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Suppose (X) is a set of feature vectors
corresponding to a particular data corpus and Z is the set of corresponding NER labels
of each word in X. The CRF model is a graph G(V, E) such that the vertices(V) represent
the NER tags(zi 2 Z). In a CRF model, zi 2 Z is the random variable and it adheres
to the Markov property. The Z is conditioned on X such that p(zijX; zi; i j) where
i j means i and j vertices are neighbors in G [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Since the CRF model provides
better accuracies for the Sinhala language, we use the CRF model for our experiments
in an incremental manner [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Online and Incremental Learning</title>
        <p>
          Online learning algorithms are those which execute the training process on the data as it
becomes available and not all at once. Incremental learning is an online learning strategy
that works with limited memory resources and relies on the compact representation of
the already observed data [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The key challenges related to online and incremental
learning are as follow [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]:
1. Concept Drift: When data become available in different time steps then there exist
several changes in the data distribution which relevant to the time dimension.
2. Catastrophic Forgetting: Online learning models keep learning as long as the data
comes to the model. When it learns new information, there is a chance to forget
previously learn things. The forgetting speed will determine how fast the online
learning model learns new information. The process of forgetting previously
information called catastrophic forgetting.
3. Stability Plasticity Dilemma: If an online learning model learns new information
quickly, then it will forget past information immediately. On the other hand, if an
online learning model decreases the leaning speed, it will drop some of the crucial
information from the learning process. This challenge of handling both ends called
Stability Plasticity Dilemma.
        </p>
        <p>
          Carreras et. al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] introduced a new approach for NER referred to as the voted
perceptron model which is based on an online perceptron strategy. The online algorithm
they propose was a mistake-driven online algorithm. The execution of the online
algorithm could be categorized into two phases. First, the algorithm is applied to learn at the
word level to identify named entity candidates utilizing a Begin-Inside (BIO)
classification. Then the algorithm makes use of functions learned at the phrase level. Finally
they apply the online learning strategy at a sentence level. In our research, we are trying
to apply online learning using this mistake-driven online strategy. For the English
language, they obtained overall precision, recall and F-measure values of 85.81%, 82.84%,
and 84.30% respectively [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          Recurrent Neural Network (RNN) architecture has been designed to learn in an
online manner [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and for handling structured prediction tasks making it a good fit for the
problem at hand. However, the RNN model performs poorly when there are long-term
dependencies in the sequence prediction task. To handle these long-term
dependencies, researchers enhanced the internal structure of the RNN cells into Long Short-Term
Memory (LSTM) cells [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>
          Athavale et. al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] applied a deep neural network approach for Hindi NER using
different kinds of RNN layers, namely, Vanilla RNN, LSTM, and Bi-directional LSTM.
From these three types, the bi-directional LSTM layered model outperforms the other
two. As the final output, they obtained 90.32% accuracy for Conference on Natural
Language Learning for the CoNLL-2003 dataset without using any Gazetteer information.
Chiu and Nichols [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] implemented a hybrid bidirectional LSTM and a Convolutional
Neural Network (CNN) architecture for their classification. The experiments they
carried out used word level and character level features for NER classification. Finally, they
obtained 91.62 F1 score on the CoNLL-2003 dataset. Huang et. al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] combined an
LSTM model with a CRF model for sequence tagging tasks. The bi-directional LSTM
layer has the capability of using past and future features to make predictions, while the
CRF layer has the capability of using sentence-level features. They obtained 97.55%
accuracy from their bi-directional LSTM-CRF model.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Implementation</title>
      <p>The implementation of the research followed the common machine learning pipeline as
below:
1. Data Preprocessing
2. Apply Online Learning Algorithm
3. Post Processing</p>
      <p>
        This research followed the supervised learning approaches because supervised
learning methods obtained superior performance compared to other learning methods [
        <xref ref-type="bibr" rid="ref11 ref14">14,11</xref>
        ].
The vocabulary was ordered according to the word frequency.
      </p>
      <p>Algorithm 1 Train and Predict
1: procedure TRAIN AND PREDICT(data, tokens)
2: train data,test data,validation data SPLIT(data)
3: train tokens,test tokens,validation tokens SPLIT(tokens)
4: X1; X2; X3; X4 CREATE MINIBATCHES(train data)
5: Y1; Y2; Y3; Y4 CREATE MINIBATCHES(train tokens)
6: ML INITIALIZE(params)
7: for i in (1, 2, 3, 4) do
8: TRAIN(ML, Xi, Yi)
9: VALIDATE(ML, validation data, validation tokens)
10: results PREDICT(ML, test data)
11: end for
12: ANALYZE(results, test tokens)
13: end procedure</p>
      <p>Algorithm 1, first splits the data using a SPLIT function and divides the data set into
train, test, and validation sets. Subsequently, the CREATE MINIBATCHES function
creates four (in this case) mini-batches to train the model using the training dataset. The
ML models are then initialized. The f or loop in the algorithm iterates through each
mini-batch. The training, validating, and testing procedures are applied to each
minibatch in that loop. At the end of the for loop, predicted results are stored in the variable
results. The results are used to analyze the performance of the proposed ML models.</p>
      <p>The research proposes two online ML models: an Online Conditional Random
Fields (CRF) Model and a Bidirectional Long Short Term Memory-Conditional
Random Fields (LSTM-CRF) Model.</p>
      <p>
        The architectures of the two proposed models are shown in Figure 1. Model A
depicts the Online CRF model while Model B depicts the bidirectional LSTM-CRF model
with a dropout layer. Both models start with an embedding layer which converts words
into vector representations. The previously unseen words will not be represented in the
BoW vector model because it generates a sparse vector representation. This problem
of unseen data does not occur in the embedding technique because it creates a dense
vector space [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The proposed two models have a final CRF layer that can use sentence-level
features in the predictions. Model B also uses a bidirectional LSTM layer which can use
past and future contextual features for making predictions. The architecture of Model B
is influenced by a previously proposed solution [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However, we modify their model
by adding a dropout layer. The primary duty of this dropout layer was to randomly
ignore several neurons of the network, during the training phase. This random dropping
of neurons makes other neurons contribute to the prediction which would normally be
done by the dropped neurons. Thus over specializing certain neurons to specific
predictions is removed from the neural network. This allows the model to avoid overfitting [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The previously discussed challenges of online learning (Catastrophic Forgetting,
Concept Drift, and Stability-Plasticity Dilemma) become critical if the model applied quick
updating. The ML models can handle these challenges if they can regulate these quick
updates. Since the dropout layer regulates these quick updations, the dropout layer can
handle these challenges.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>Two tasks were carried out in this research: 1). Sinhala NER and 2). Sinhala POS
tagging. For these experiments we used Sinhala NER and POS tagging datasets from
the Language Technology Research Laboratory (LTRL) of the University of Colombo
School of Computing (UCSC) under the Lesser General Public License (LGPL) for
Linguistic Resources. The datasets applied to the proposed models adhere to the
CoNLL2003 format.</p>
      <p>The research proposed two online learning models, namely, an Online CRF and
a Bidirectional LSTM-CRF. We carried out two experiments for each task mentioned
above. Hence the overall research consisted of four experiments using the models. For
each experiment in the research, we simulate the batch learning technique and the online
learning technique to compare and contrast the performance of the learning techniques.
Each experiment used four mini-batches of data. Suppose the four mini-batches are
denoted as m1; m2; m3; and m4. A separate dataset T is used to test the performance
of the models. In each training step, the batch learning model retrains from scratch while
the online learning model has been saved and used it for further training. The training
phase of the batch learning approach and the online learning approach consisted of four
steps. These four steps of each batch learning and online learning experiment can be
described as follows:</p>
      <p>– Batch Learning Experiment
Step 1: Initially trained using m1.</p>
      <p>Step 2: When m2 mini-batch becomes available, we train the model from scratch using
whole aggregated (m1 + m2) dataset.</p>
      <p>Step 3: When m3 mini-batch becomes available, we train the model from scratch using
whole aggregated (m1 + m2 + m3) dataset.</p>
      <p>Step 4: When m4 mini-batch becomes available, we train the model from scratch using
whole aggregated (m1 + m2 + m3 + m4) dataset.</p>
      <p>– Online Learning Experiment
Step 1: Initially trained using m1 mini-batch and test on dataset T .</p>
      <p>Step 2: When m2 becomes available, the model trained only using m2.
Step 3: When m3 becomes available, the model trained only using m3.
Step 4: When m4 becomes available, the model trained only using m4.
5
5.1</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <sec id="sec-5-1">
        <title>Online CRF Model</title>
        <p>Sinhala NER Experiment For the Sinhala NER Experiment, Figure 2 shows the
variation of precision, recall, and F1-measure of the online and batch learning techniques in
each step of the experiment. The precision values of the batch learning were higher in
the first three steps, but in the fourth step, the online learning method performed better.
The Recall values of batch learning techniques have higher values in all the steps,
except for the third step of the experiment. The F1-measure values of batch learning were
higher throughout the experiment. However, the online learning F1-measure value was
closely related to the batch learning value in the last mini-batch.</p>
        <p>Note that batch learning and online learning gives distinct accuracies in the initial
step of this experiment. Since we have used the same dataset for the learning process,
most of the time these accuracies are close. However, we train these models separately.
Each ML model (batch and online) starts with a set of random values and then are
trained on the data. These random initializations may have affected in changing the
initial accuracies. More importantly, even though we use the same dataset for initial
training, the training procedures are quite different. The online learning algorithm
performs quick updates to the ML model while the batch learning algorithm does not
perform such quick updates. These updates also affect the accuracies of online and batch
learning.
Sinhala POS Tagging Experiment We checked the applicability of the proposed
Onlin CRF model for the Sinhala POS Tagging task. Figure 3 depicts the accuracy
variation of the POS tagging experiment for both the batch learning and online learning
approaches. The accuracy values of the online learning model were closely related to
the batch learning technique. Further, in the first step of the experiment, the online
learning model obtained slightly higher accuracies than the batch learning approach.
However, in the other steps of the experiment, batch learning techniques demonstrate
slightly higher accuracies.
Sinhala-NER Experiment For the Sinhala-NER Experiment, Figure 4 shows the
variation of precision, recall, and F1-measure of the batch learning technique with the
online learning technique, for each step of the experiment. The precision values of the
online learning model were closely related to the batch learning model except for the
second step. When examining the recall values, the second and third steps of the batch
learning technique gave higher values and in other steps, the online learning technique
closely related to the batch learning technique. Also when it comes to F1-measure
values, the online learning model was closely related to the batch learning model except
for the second step.
Sinhala-POS Tagging Experiment We also checked the applicability of the
bidirectional LSTM-CRF model for the Sinhala-POS Tagging task. The variation in the
accuracy of online and batch learning strategies are shown in Figure 5. The accuracy values
of the online learning and batch learning techniques are closely related to each other as
apparent from Figure 5.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Training Time Comparison</title>
        <p>As shown in Figure 6, the training time consumed by online learning models remains
almost constant in each step. However, the training time of batch learning models has
increased linearly in each step.
One of the main objectives of this research was to find out how much data would be
needed for the proposed models to reach acceptable accuracy. The research analyzed
the performance variation of these experiments over a given mini-batch. For each
experiment, we generate a function to approximate the performance variation. That
approximated function is then used to predict the size of data that the proposed models
require to obtain acceptable performance. For the NER experiment, we have considered
the F1-measure variation since it is the combination of precision and recall. Since the
POS tagging experiment has more than 22 labels to be predicted, it is convenient to use
accuracy as the evaluation metric instead of precision, recall, and F1-measure. Hence,
the accuracy variation is used for the dataset size prediction in the Sinhala POS tagging
experiment.</p>
        <p>According to Figure 2 and 4, we observed that the performance variation of NER
follows a logarithmic scale. Hence, the F1-measure variation of Sinhala NER was
approximated using a logarithmic regression. The results of the NER prediction are shown
in Table 1. The first column of Table 1 contains the experiment name. The second
column contains the approximated function using logarithmic regression. These
logarithmic functions are defined in the range [0-100) exclusively. The independent variable
of these functions is the number of steps in the experiment. Each step in the Sinhala
NER experiment is trained using a separate data source (mini-batch) of 662 sentences.
Thus the third column contains the number of data sources (the data source size is 662
sentences) that is needed to obtain near optimal performance. The final column shows
how many sentences are needed to obtain such performance based on the fitted model.</p>
        <p>We currently have a Sinhala NER dataset consisting of 3268 sentences. To obtain
acceptable performance from the CRF model, the Sinhala NER data set needs to be at
least 5296 sentences long, while the bidirectional LSTM-CRF model requires a total
of 7282 sentences. Both experiments point to the need to approximately double the
Sinhala NER dataset in order to obtain acceptable performance.</p>
        <p>Sinhala POS tagging did not have a large enough dataset in the training phase,
to observe the incremental improvement of accuracy in the POS tagging experiments.
Hence, the dataset size prediction for this task were not possible to calculate.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We proposed two online learning algorithms: an Online Conditional Random Fields
(CRF) and a Bidirectional Long Short Term Memory-Conditional Random Fields
(LSTMCRF). Both models can be used across various NLP tasks such as NER and POS
tagging. Since the models do not use any language-dependent features, these models can
also be used across any natural language. The proposed models increase the
performance in each incremental training step. The training time consumed by the proposed
models remains constant over each incremental training step. The dataset size needed to
reach acceptable performance for Bidirectional LSTM-CRF models is higher compared
to that needed for the Online CRF model. However, the performance of the Bidirectional
LSTM-CRF model is higher than the Online CRF model. The training time of the
Online CRF model is low compared to the deep learning model. The analysis of four
experiments showed that online learning techniques can reach batch learning
performances. The training time for online learning methods remains nearly constant in each
training step. However, the training time for batch learning increases linearly. The
inclusion of the dropout layer for the proposed online learning model solved the stated key
challenges (Catastrophic Forgetting, Concept Drift, and Stability-Plasticity Dilemma)
of online learning. Most importantly, the dropout layer gives consistent growth to the
online ML model.</p>
      <p>
        The state-of-the-art Sinhala POS tagging experiment by Gunasekara et. al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
obtained 72% accuracy from their hybrid approach. The final accuracies of our Online
CRF model have improved the state-of-the-art accuracy by nearly 4%. The
Bidirectional LSTM-CRF model has also improved the state-of-the-art accuracy by 4
percentage points. The dataset needed for Sinhala NER to perform at acceptable levels
is approximately double the current size. The Sinhala POS tagging task requires a large
heterogeneous dataset to learn new information from each incremental training step and
is currently too small to be used to estimate from.
7
      </p>
    </sec>
    <sec id="sec-7">
      <title>Future Work</title>
      <p>Integrating the online learning based NER and POS tagging models with other NLP
tasks can be implemented as future work. The actual usage of these online learning
techniques become more worthwhile after converting the major NLP tasks such as
Information Extraction, Machine Translation, Automatic Summarization, and
Information Retrieval into the online learning strategy.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Athavale</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bharadwaj</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pamecha</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prabhu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shrivastava</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Towards deep learning in hindi NER: an approach to tackle the labelled data sparsity</article-title>
          .
          <source>CoRR abs/1610</source>
          .09756 (
          <year>2016</year>
          ), http://arxiv.org/abs/1610.09756
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brownlee</surname>
          </string-name>
          , J.:
          <article-title>Dropout regularization in deep learning models with keras, machinelearningmastery.com/ dropout-regularization-deep-learning-models-keras/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Brownlee</surname>
          </string-name>
          , J.:
          <article-title>How to use word embedding layers for deep learning with keras, machinelearningmastery.com/ use-word-embedding-layers-deep-learning-keras/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Brownlee</surname>
          </string-name>
          , J.:
          <article-title>Instability of online learning for stateful lstm for time series forecasting, machinelearningmastery.com/ instability-online-learning-stateful-lstm-time-series-forecasting/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Carreras</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , Ma`rquez, L.,
          <string-name>
            <surname>Padro</surname>
            <given-names>´</given-names>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>Learning a perceptron-based named entity chunker via online recognition feedback</article-title>
          .
          <source>In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4</source>
          . pp.
          <fpage>156</fpage>
          -
          <lpage>159</lpage>
          . CONLL '
          <volume>03</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2003</year>
          ). https://doi.org/10.3115/1119176.1119198, https://doi.org/10.3115/1119176. 1119198
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chiu</surname>
            ,
            <given-names>J.P.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nichols</surname>
          </string-name>
          , E.:
          <article-title>Named entity recognition with bidirectional lstm-cnns (</article-title>
          <year>2015</year>
          ), http://arxiv.org/abs/1511.08308, cite arxiv:
          <volume>1511</volume>
          .
          <fpage>08308</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dahanayaka</surname>
            ,
            <given-names>J.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weerasinghe</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          :
          <article-title>Named entity recognition for sinhala language</article-title>
          .
          <source>In: 2014 14th International Conference on Advances in ICT for Emerging Regions (ICTer)</source>
          . pp.
          <fpage>215</fpage>
          -
          <lpage>220</lpage>
          (
          <year>Dec 2014</year>
          ). https://doi.org/10.1109/ICTER.
          <year>2014</year>
          .7083904
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gepperth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hammer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Incremental learning algorithms and applications (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gimpel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>N.A.</given-names>
          </string-name>
          :
          <article-title>Distributed asynchronous online learning for natural language processing</article-title>
          .
          <source>In: Proceedings of the Fourteenth Conference on Computational Natural Language Learning</source>
          . pp.
          <fpage>213</fpage>
          -
          <lpage>222</lpage>
          . CoNLL '10,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2010</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>1870568</volume>
          .
          <fpage>1870593</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gunasekara</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welgama</surname>
            ,
            <given-names>W.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weerasinghe</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          :
          <article-title>Hybrid part of speech tagger for sinhala language</article-title>
          .
          <source>In: 2016 Sixteenth International Conference on Advances in ICT for Emerging Regions (ICTer)</source>
          . pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          (
          <year>Sept 2016</year>
          ). https://doi.org/10.1109/ICTER.
          <year>2016</year>
          .7829897
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. H.
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Bhandari</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Mistry</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Thakor</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Ahir: Study of named entity recognition for indian languages</article-title>
          .
          <source>International Journal of Information Sciences and Techniques (IJIST)</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bidirectional lstm-crf models for sequence tagging</article-title>
          .
          <source>CoRR abs/1508</source>
          .
          <year>01991</year>
          (
          <year>2015</year>
          ), http://dblp.uni-trier.de/db/journals/ corr/corr1508.html#HuangXY15
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lafferty</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.C.N.</given-names>
          </string-name>
          :
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In: Proceedings of the Eighteenth International Conference on Machine Learning</source>
          . pp.
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          . ICML '
          <fpage>01</fpage>
          , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (
          <year>2001</year>
          ), http://dl.acm.org/citation.cfm? id=
          <volume>645530</volume>
          .
          <fpage>655813</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Nadeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sekine</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A survey of named entity recognition and classification</article-title>
          .
          <source>Linguisticae Investigationes</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <fpage>3</fpage>
          -
          <lpage>26</lpage>
          (
          <year>January 2007</year>
          ), www.ingentaconnect.com/content/ jbp/li/2007/00000030/00000001/art00002, publisher: John Benjamins Publishing Company
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Olah</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Understanding lstm networks</article-title>
          , http://colah.github.io/posts/ 2015-08-Understanding-LSTMs/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>