<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DEXTER - Data EXTraction &amp; Entity Recognition for Low Resource Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nihal V. Nayak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pratheek Mahishi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sagar M. Rao Stride.AI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bengaluru</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>nihal.nayak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>pratheek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>sagarg@stride.ai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. Gerber</institution>
          ,
          <addr-line>D. Lenat, F. van Harmelen, P. Clark (Eds.)</addr-line>
          ,
          <institution>Proceedings of the AAAI 2019 Spring Symposium on Combining Machine Learning with Knowledge Engineering (AAAI-MAKE 2019). Stanford University</institution>
          ,
          <addr-line>Palo Alto, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>55</fpage>
      <lpage>60</lpage>
      <abstract>
        <p>Extraction of key information such as named entities, key phrases, and numbers is critical for several banking and financial processes. Banks and Financial Institutions resort to the use of automation tools to reduce the human effort required for these processes. Training a system to extract key datapoints reliably and efficiently from text requires large labeled datasets. However, openly available datasets in the financial sector have limited labeled data. In our paper, we address the issues in developing a data extraction system for low resource datasets. We experiment with a Bi-directional long shortterm memory (Bi-LSTM) model which works well on low resource datasets. We introduce a novel domain-specific BiLSTM layer, which allows us to add domain-specific knowledge into the neural architecture. We observed that transfer learning from out-of-domain dataset boosts the accuracy on several extraction tasks. We create three new low resource financial datasets and demonstrate that our model consistently achieves a high degree of accuracy on these datasets. Furthermore, our model outperforms the reported state of the art results on the Financial NER dataset and achieves F1 of 87.48. Our experiments consistently show that transfer learning combined with domain-specific knowledge engineering improves entity recognition in a low resource setting.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Financial Institutions deal with a large number of
documents in the form of contracts, reports, application forms
etc. These documents are highly unstructured and textual
in nature. Processing such documents involve the extraction
of key information (entities, contract clauses, key phrases,
numbers, etc.). Traditionally, companies have relied on
domain experts to capture this information which is
timeconsuming. However, recent trends suggest that specialized
tools and algorithms are being used to extract key data points
from documents to augment and reduce human effort.</p>
      <p>Building a system to extract datapoints from unstructured
text documents poses several challenges, especially in the
financial domain. First, the style of writing varies significantly
when compared to news articles, blogs, etc. as “domain
specific” lexicons and jargon are used extensively. Secondly,
development of any kind of dataset for financial text requires
domain experts to label the data. The process of annotation
is expensive and cumbersome. Lastly, Financial Institutions
are hesitant to share their data as it raises several privacy
concerns. Therefore, these constraints curtail the research in
the field.</p>
      <p>The following sentence is extracted from a financial
document</p>
      <p>This LOAN AGREEMENT, dated as of November 17,
2014 (this Agreement), is made by and among
Auxilium Pharmaceuticals, Inc., a corporation incorporated
under the laws of the State of Delaware (U.S.
Borrower), Auxilium UK LTD, a private company
limited by shares registered in England and Wales (UK
Borrower and, collectively with the U.S. Borrower,
the Borrowers) and Endo Pharmaceuticals Inc., a
corporation incorporated under the laws of the State of
Delaware (Lender).1</p>
      <p>
        From this sample, we may want to extract the date
(“November 17, 2014”), type of agreement (“LOAN
AGREEMENT”), names of the borrowers (“Auxilium
Pharmaceuticals, Inc.” and “Auxilium UK LTD”) and the lender
(“Endo Pharmaceuticals Inc.”). In practice, there are few
simple approaches for extracting the data. One of which is
a combination of heuristics and out-of-the-box NER tools.
We can make use of regular expressions to extract the date
and the agreement name. We can use spaCy2 or CoreNLP
        <xref ref-type="bibr" rid="ref18">(Manning et al. 2014)</xref>
        to extract the company names. We
observed that this approach is not scalable and requires
enormous amount of effort to carefully craft the heuristic rules to
capture all the key datapoints across different types of
documents.
      </p>
      <p>Therefore, our motivation is to develop a domain
specific datapoint extraction and entity recognition system, even
when very little labeled data is available. We treat the
problem of extracting the datapoints from unstructured text as
a sequence labeling problem and make use of techniques
from Named Entity Recognition (NER) and sequence
labeling research. Recent efforts in NER research have
fo</p>
      <sec id="sec-1-1">
        <title>1Loan Agreement - https://goo.gl/8djHXe 2spaCy - https://spacy.io</title>
        <p>
          cused on neural architectures
          <xref ref-type="bibr" rid="ref10 ref14 ref16 ref23 ref4 ref7 ref8">(Chiu and Nichols 2016;
Lample et al. 2016; Dernoncourt, Lee, and Szolovits 2017a)</xref>
          .
These neural methods require large amounts of training data.
Therefore, our motivation is to develop techniques for low
resource datasets.
        </p>
        <p>Studies have shown that transfer learning technique
improves the overall performance of the model when there is
limited labeled training data. Transfer Learning is a
technique where a large dataset (source dataset) is trained with
a neural architecture and the learned parameters are used to
initialize the weights of the target model.</p>
        <p>In our work, we experiment with a Bi-directional Long
Short-Term Memory(Bi-LSTM) architecture which works
well on low resource datasets. We also develop a novel
mechanism to introduce domain-specific knowledge to the
neural architecture. Additionally, we show that transfer
learning from a pretrained model improves the performance
of the models.</p>
        <p>Our experiments on 4 financial datasets, including three
low-resource datasets - Custodian, Asset Manager, and
Leverage Ratio confirm that our architecture works well for
low resource conditions.</p>
        <p>Key contributions of this paper are
Neural Architecture for introducing domain knowledge
into the network
Study on transfer learning for sequence labeling in a low
resource scenario</p>
        <p>Our paper is organized as follows. First, we discuss recent
works in sequence labeling, low resource deep learning and
finance. Second, we describe the datasets and the
methodology used for creating the 3 datasets used in our
experiments. We then describe the neural architecture used in our
experiments. Next, we detail our experiments and results.
We perform an ablation study to understand the influence of
each of layer in the network with and without transfer
learning. Lastly, we conclude the paper with discussion about our
work and potential future work.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        Traditionally, sequence labeling problems like NER and Part
of Speech Tagging have used Maximum Entropy Models
and hand crafted features
        <xref ref-type="bibr" rid="ref13 ref20 ref3">(Mikheev, Moens, and Grover
1999; Bender, Och, and Ney 2003)</xref>
        . The use of neural
networks for NER was popularized by
        <xref ref-type="bibr" rid="ref6">(Collobert et al. 2011)</xref>
        .
Since then, there have been several improvements to the
neural architecture for identifying named entities
        <xref ref-type="bibr" rid="ref15 ref21">(Yadav and
Bethard 2018)</xref>
        . Most competitive NER systems use a
Bidirectional Long Short Term Memory (Bi-LSTM) over the
word and character embeddings, which closely resembles
the architecture described in
        <xref ref-type="bibr" rid="ref10 ref14">(Lample et al. 2016)</xref>
        .
      </p>
      <p>
        <xref ref-type="bibr" rid="ref10 ref14">(Lample et al. 2016)</xref>
        concatenate word embeddings with
a Bi-LSTM over the characters of a word. Then, they pass
these embeddings through a sentence level Bi-LSTM and a
Conditional Random Field (CRF) layer to produce the
labels.
        <xref ref-type="bibr" rid="ref16 ref23 ref7 ref8">(Dernoncourt, Lee, and Szolovits 2017b)</xref>
        implement
a similar architecture in their software - NeuroNER. We
draw inspiration from
        <xref ref-type="bibr" rid="ref10 ref14">(Lample et al. 2016)</xref>
        and
        <xref ref-type="bibr" rid="ref16 ref23 ref7 ref8">(Dernoncourt, Lee, and Szolovits 2017b)</xref>
        for our model architecture.
      </p>
      <p>
        These networks can be trained on a large dataset and then
fine-tuned for a target dataset. Recent efforts in Transfer
Learning have yielded positive results in NLP Tasks
        <xref ref-type="bibr" rid="ref10 ref14 ref15 ref16 ref21 ref23 ref7 ref8">(Mou
et al. 2016; Young Lee, Dernoncourt, and Szolovits 2017;
Newman-Griffis and Zirikly 2018)</xref>
        .
      </p>
      <p>
        <xref ref-type="bibr" rid="ref10 ref14">(Mou et al. 2016)</xref>
        conduct a thorough study on the
transferability of neural networks in NLP. Their findings indicate
that word embeddings trained on a source dataset are
transferable to a semantically different task.
      </p>
      <p>
        <xref ref-type="bibr" rid="ref16 ref23 ref7 ref8">(Young Lee, Dernoncourt, and Szolovits 2017)</xref>
        use
transfer learning techniques for de-identification of Protected
Health Information (PHI) in Electronic Health Records
(EHR). They train a sequence labeling model on two
datasets - i2b2 2014 and i2b2 2016. They successfully
demonstrate that transferring parameters from an
out-ofdomain model outperforms the state of the art results. A key
finding from their analysis was that transferring the
parameters from the lower layers of a pretrained model was almost
as efficient as transferring the parameters from the entire
network.
      </p>
      <p>
        Our work in financial data extraction closely relates to
        <xref ref-type="bibr" rid="ref1">(Alvarado, Verspoor, and Baldwin 2015)</xref>
        . In their
experiments, they use a Conditional Random Field (CRF) and
manually choose features. They train their model on an
outof-domain dataset
        <xref ref-type="bibr" rid="ref20 ref3">(Tjong Kim Sang and De Meulder 2003)</xref>
        and perform domain adaptation on the target dataset. Their
results indicate that training only with a small in-domain
dataset is better than training with a large out-of-domain
dataset and a small in-domain dataset together.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>
        We use five datasets in our experiments. For training the
out-of-domain model3, we use CoNLL 2003 English dataset
        <xref ref-type="bibr" rid="ref20 ref3">(Tjong Kim Sang and De Meulder 2003)</xref>
        . We use the
following financial datasets in our experiments- (1) Financial NER
Dataset
        <xref ref-type="bibr" rid="ref1">(Alvarado, Verspoor, and Baldwin 2015)</xref>
        (2)
Custodian (3) Asset Manager (4) Leverage Ratio. The
Financial NER dataset is an open source named entities dataset.
Custodian, Asset Manager and Leverage Ratio are
internal datasets. We provide detailed descriptions about these
datasets in the next section.
      </p>
      <sec id="sec-3-1">
        <title>Financial NER Dataset</title>
        <p>
          <xref ref-type="bibr" rid="ref1">(Alvarado, Verspoor, and Baldwin 2015)</xref>
          create their dataset
by annotating financial agreements made public by the U.S.
Security and Exchange Commission (SEC) filings. They
annotate a total of 8 documents for LOCATION,
ORGANIZATION, PERSON and MISCELLANEOUS.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Custodian, Asset Manager and Leverage Ratio</title>
        <p>To test our model in the wild, we collected mutual fund
prospectus documents which are publicly available on the
internet. These documents are fairly large in size (varies
from 80 to 300 pages) and have no discernible patterns
which can be used by a heuristic system. The documents
were collected from the websites of individual fund houses
3This model will be referred as out-of-domain model and
pretrained model interchangeably
(Ex. BlackRock4) or investment research services (Ex.
Morningstar5). From these documents we identify a few key
datapoints like Custodian, Asset Manager, Leverage Ratio,
etc. which are relevant to organizations dealing with such
documents. Our task was to extract the correct entities for
each of these datapoints from candidate sentences retrieved
from the source document.</p>
        <p>In order to create the dataset for Custodian, Asset
Manager and Leverage Ratio, we use a proprietary tool to
identify parts of the PDF such as table of contents, section
headings, keywords, etc. and localize to the approximate region
of interest, where the datapoint could be present. Then, the
domain experts manually annotate all candidate sentences
identifying the correct datapoints.</p>
        <p>In Table 1, we describe all the datasets used in our paper.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Model Architecture</title>
      <p>
        Our proposed model uses two Bi-LSTM layers - character
and word and a domain specific Bi-LSTM layer. First, we
have the character embedding layer which passes through a
character Bi-LSTM layer. Then, the output of the character
Bi-LSTM layer is concatenated with the word embeddings.
We also concatenate the output of the domain-specific layer
to the word embedding. We use GloVe word embeddings
        <xref ref-type="bibr" rid="ref18">(Pennington, Socher, and Manning 2014)</xref>
        . The concatenated
word embedding is passed through a word Bi-LSTM layer.
The output of this layer is passed to the projection layer
and followed by a Conditional Random Field (CRF) layer
to generate the output. Our model is shown in Figure 1.
      </p>
      <sec id="sec-4-1">
        <title>Domain Specific Knowledge Engineering</title>
        <p>We observed that the correct named entities are often
accompanied by dataset specific keywords. Consider the following
example from the Asset Manager dataset</p>
        <p>Since January 1, 2002, the Fund is managed by
Fideuram Gestions S.A. (the Management Company), a
Luxembourg company, controlled by Banca Fideuram
S.p.A. (Intesa Sanpaolo Group). 6</p>
        <p>From the above sentence, we observe that the correct
named entity is ‘Fideuram Gestions S.A.’ and is
accompanied by the keyword ‘Management Company’, which is a</p>
        <sec id="sec-4-1-1">
          <title>4BlackRock - https://goo.gl/bs3vU3 5Morningstar - https://www.morningstar.com/ 6Fideuram Fund - https://goo.gl/UDQqiA</title>
          <p>known synonym for the Asset Manager. The datapoint
Asset Manager has several other keywords such as Investment
Advisor, Investment Manager, etc. These keywords are
different for Custodian, Leverage Ratio and Financial NER.</p>
          <p>In order to introduce this domain knowledge into our
neural network, we encode this information as embeddings and
pass it to a Bi-LSTM layer. The output of the Bi-LSTM
network is concatenated with the word embedding.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Transfer Learning</title>
        <p>
          Our transfer learning approach is similar to the methods
followed by
          <xref ref-type="bibr" rid="ref16 ref23 ref7 ref8">(Young Lee, Dernoncourt, and Szolovits 2017)</xref>
          ,
where we transfer the parameters of different layers from
        </p>
        <sec id="sec-4-2-1">
          <title>Baseline</title>
          <p>Domain
Word
Character
Projection
Word + Character
Word + Character + Domain
Word + Character + Domain
+ Projection
86.96
89.36
the pretrained model to the target model. We transfer the
parameters of the character embeddings and word embeddings.
In case we do not perform transfer learning, we randomly
initialize the character embeddings and domain-specific
embeddings and use GloVe embeddings for the words.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Setup</title>
      <p>In our study, we experiment by transferring parameters at
various layers from an out-of-domain model. The Baseline
model is trained only on the in-domain dataset (only
Custodian or Asset Manager or Leverage Ratio or Financial NER
dataset). We train the model with the same architecture
described in 1 without the domain-specific features.</p>
      <p>
        For the pretrained model, we train a Baseline Model
on the CoNLL 2003 English dataset
        <xref ref-type="bibr" rid="ref20 ref3">(Tjong Kim Sang and
De Meulder 2003)</xref>
        . We achieve F1 of 89.30 on the CoNLL
2003 Test Set. All the results in our experiments are obtained
by transferring the parameters from this pretrained model.
      </p>
      <p>In our experiments, we transfer the following layers
(1) Word Embeddings (Word ) (2) Character Embeddings
(Character ) (3) Projection Layer (Projection ). We
additionally activate the Domain-Specific Features in our
network. (Domain ).</p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>
        We describe our results on the Custodian, Asset Manager
and Financial NER dataset in Table 2. It can be observed
that the best performing models have transferred
parameters from word and character embeddings and along with
the domain-specific features for the Custodian and Asset
Manager dataset. From Table 2, it is evident that our neural
architecture without transfer learning, outperforms the
reported state of the art results on the Financial NER dataset7.
7
        <xref ref-type="bibr" rid="ref1">(Alvarado, Verspoor, and Baldwin 2015)</xref>
        report F1 of 82.7
Our best performing model achieves F1 of 87.48 on the
Financial NER dataset which makes use of transferred word
and character embeddings. Results in Table 3 suggests that
domain-specific layer enhances the model’s performance.
      </p>
      <p>
        We observe that in all the datasets, the domain-specific
features improve over the baseline F1. However, in the case
of the Financial NER dataset we note that the best
performing system is when word and character embedding layer
is transferred. This observation is consistent with the
findings mentioned in
        <xref ref-type="bibr" rid="ref16 ref23 ref7 ref8">(Young Lee, Dernoncourt, and Szolovits
2017)</xref>
        , where most of the lower layers contribute to the
greatest improvement of the model. But, we find that the
including the final layer or the task dependent layer decreases the
performance.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>
        For our future work, we would like to combine our word
embeddings with ELMo Embeddings
        <xref ref-type="bibr" rid="ref11 ref19 ref9">(Peters et al. 2018)</xref>
        and
BERT Embeddings
        <xref ref-type="bibr" rid="ref11 ref19 ref9">(Devlin et al. 2018)</xref>
        . We intend to
introduce document level meta data like PDF layout and local
meta information such as bold, underline and italics in to the
domain specific layer.
      </p>
      <p>
        Our work can be extended to clinical texts, where
annotating data is very expensive. Our work closely relates to
MultiTask Learning (MTL). Recent works have shown promise in
Multi-Task Learning for Sequence Labeling Problems in a
low resource scenarios
        <xref ref-type="bibr" rid="ref11 ref16 ref19 ref23 ref7 ref8 ref9">(Peng and Dredze 2017; Lin et al.
2018)</xref>
        .
      </p>
      <p>In conclusion, we demonstrate a Bi-LSTM architecture
for low resource datasets. Our experiments consistently
show that transfer learning combined with domain-specific
knowledge engineering improves entity recognition in a low
resource setting.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>We would like to thank our anonymous reviewers for their
helpful feedback in improving our work. We wish to thank
Arjun Rao for internally reviewing the paper. Lastly, we
thank the Stride.AI team for their valuable inputs in the
research.</p>
    </sec>
    <sec id="sec-9">
      <title>Appendices</title>
      <p>Examples In this section, we show a few sample examples
from our datasets. Refer to Table 4 5 and 6</p>
      <sec id="sec-9-1">
        <title>Example</title>
        <p>The ICAV has appointed RBC Investor Services
Bank S.A to act as Depositary for the safekeeping
of all the investments, cash and other assets of the
ICAV and to ensure that the issue and repurchase
of Shares by the ICAV and the calculation of the
Net Asset Value and Net Asset Value per Share
is carried out and that all income received and
investments made are in accordance with the
Instrument of Incorporation and the UCITS
Regulations.</p>
      </sec>
      <sec id="sec-9-2">
        <title>Explanation</title>
        <p>The custodian is RBC Investor Services
Bank S.A which is referred to as Depositary
in the sentence. Although ICAV and
UCITS are Organizations, they are
not the Custodian.</p>
      </sec>
      <sec id="sec-9-3">
        <title>Example</title>
        <p>Prior to joining Deutsche Bank, Barbara
was a Fund Tax Project Manager at
Dexia-BIL, Dexia Fund Services in
Luxembourg for two (2) years, and a
Senior Fund Manager for DWS
Investment S.A. (now the Management
Company) in Luxembourg for ten
(10) years.</p>
      </sec>
      <sec id="sec-9-4">
        <title>Entity</title>
        <p>DWS Investment S.A.</p>
      </sec>
      <sec id="sec-9-5">
        <title>Explanation</title>
        <p>DWS Investment S.A. is the management company
or the asset manager because of the phrase
“now the Management Company”. The reason
Deutsche Bank is not the Asset Manager is because
the sentence does not mention if it is the Asset
Manager.</p>
      </sec>
      <sec id="sec-9-6">
        <title>Example</title>
        <p>Under normal market conditions the
level of leverage is expected to be
between 200% and 800% of the Net
Asset Value of the Fund where leverage
is calculated using the sum of the
absolute value of the notional amounts
of the FDI positions in accordance with
the “gross method” as set out in the
Commission Delegated Regulation.</p>
      </sec>
      <sec id="sec-9-7">
        <title>Explanation</title>
        <p>The example indicates that the expected
leverage or the leverage ratio is between
200% and 800%. The system should pick
both “200%” and “800%”.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Alvarado, Verspoor, and Baldwin 2015] Alvarado,
          <string-name>
            <given-names>J. C. S.</given-names>
            ;
            <surname>Verspoor</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          ; and Baldwin,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Domain adaption of named entity recognition to support credit risk assessment</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>In Proceedings of the Australasian Language Technology Association Workshop</source>
          <year>2015</year>
          ,
          <fpage>84</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Bender, Och, and Ney 2003]
          <string-name>
            <surname>Bender</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Och</surname>
            ,
            <given-names>F. J.;</given-names>
          </string-name>
          and Ney,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2003</year>
          .
          <article-title>Maximum entropy models for named entity recognition</article-title>
          . In Daelemans, W., and
          <string-name>
            <surname>Osborne</surname>
          </string-name>
          , M., eds.,
          <source>Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL</source>
          <year>2003</year>
          ,
          <volume>148</volume>
          -
          <fpage>151</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[Chiu and Nichols</source>
          <year>2016</year>
          ] Chiu,
          <string-name>
            <given-names>J.</given-names>
            , and
            <surname>Nichols</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>4</volume>
          :
          <fpage>357</fpage>
          -
          <lpage>370</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Collobert et al. 2011] Collobert,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Weston,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>Karlen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          ; and Kuksa,
          <string-name>
            <surname>P. P.</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>Natural language processing (almost) from scratch</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          :
          <fpage>2493</fpage>
          -
          <lpage>2537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Dernoncourt, Lee, and Szolovits 2017a]
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J. Y.</given-names>
          </string-name>
          ; and Szolovits,
          <string-name>
            <surname>P.</surname>
          </string-name>
          2017a.
          <article-title>NeuroNER: an easy-touse program for named-entity recognition based on neural networks</article-title>
          .
          <source>Conference on Empirical Methods on Natural Language Processing (EMNLP).</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Dernoncourt, Lee, and Szolovits 2017b]
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J. Y.</given-names>
          </string-name>
          ; and Szolovits,
          <string-name>
            <surname>P.</surname>
          </string-name>
          2017b.
          <article-title>Neuroner: an easy-touse program for named-entity recognition based on neural networks</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <fpage>97</fpage>
          -
          <lpage>102</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Devlin et al. 2018]
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chang, M.-W.;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Lample et al. 2016] Lample,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Ballesteros,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Kawakami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ; and
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Neural architectures for named entity recognition</article-title>
          .
          <source>In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <fpage>260</fpage>
          -
          <lpage>270</lpage>
          . San Diego, California: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>[Lin</surname>
          </string-name>
          et al.
          <year>2018</year>
          ]
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ; and Ji, H.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          2018.
          <article-title>A multi-lingual multi-task architecture for lowresource sequence labeling</article-title>
          .
          <source>In Proceedings of The 56th Annual Meeting of the Association for Computational Linguistics (ACL2018).</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Mikheev, Moens, and Grover 1999]
          <string-name>
            <surname>Mikheev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Grover</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>Named entity recognition without gazetteers</article-title>
          .
          <source>In EACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Mou et al. 2016]
          <string-name>
            <surname>Mou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Li,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Xu,
          <string-name>
            <given-names>Y.</given-names>
            ; Zhang, L.; and
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>How transferable are neural networks in nlp applications</article-title>
          ?
          <source>In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <fpage>479</fpage>
          -
          <lpage>489</lpage>
          . Austin, Texas: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Newman-Griffis and Zirikly</source>
          <year>2018</year>
          ]
          <article-title>Newman-</article-title>
          <string-name>
            <surname>Griffis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zirikly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Embedding transfer for low-resource medical named entity recognition: A case study on patient mobility</article-title>
          .
          <source>In Proceedings of the BioNLP 2018 workshop</source>
          , 1-
          <fpage>11</fpage>
          . Melbourne, Australia: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[Peng and Dredze</source>
          <year>2017</year>
          ] Peng,
          <string-name>
            <given-names>N.</given-names>
            , and
            <surname>Dredze</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>Multi-task domain adaptation for sequence tagging</article-title>
          .
          <source>In Proceedings of the 2nd Workshop on Representation Learning for NLP</source>
          ,
          <fpage>91</fpage>
          -
          <lpage>100</lpage>
          . Vancouver, Canada: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [Pennington, Socher, and Manning 2014] Pennington,
          <string-name>
            <given-names>J.</given-names>
            ; Socher, R.; and
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [Peters et al. 2018] Peters,
          <string-name>
            <given-names>M. E.</given-names>
            ;
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Iyyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ; and
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>[Tjong Kim Sang and De Meulder 2003] Tjong Kim Sang</surname>
            ,
            <given-names>E. F.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>De Meulder</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Introduction to the conll-2003 shared task: Language-independent named entity recognition</article-title>
          . In Daelemans, W., and
          <string-name>
            <surname>Osborne</surname>
          </string-name>
          , M., eds.,
          <source>Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL</source>
          <year>2003</year>
          ,
          <volume>142</volume>
          -
          <fpage>147</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>[Yadav and Bethard</source>
          <year>2018</year>
          ]
          <string-name>
            <surname>Yadav</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>A survey on recent advances in named entity recognition from deep learning models</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics</source>
          ,
          <fpage>2145</fpage>
          -
          <lpage>2158</lpage>
          . Santa Fe, New Mexico, USA: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>[Young</given-names>
            <surname>Lee</surname>
          </string-name>
          , Dernoncourt, and Szolovits 2017]
          <string-name>
            <given-names>Young</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Dernoncourt</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          ; and Szolovits,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Transfer learning for named-entity recognition with neural networks</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>