<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Eextraction of Named Entities from Bank Wire text</article-title>
      </title-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <issue>109</issue>
      <fpage>11</fpage>
      <lpage>13</lpage>
      <abstract>
        <p>Online transactions have increased dramatically over the years due to rapid growth in digital innovation. These transactions are anonymous therefore user provide some details for identification. These comments contain information about entities involved and transfer details which are used for log analysis later. Log analysis can be used for fraud analytics and detect money laundering activities. In this paper, we discuss the challenges of entity extraction from such kind of data. We briefly explain what wired text is, what are the challenges and why semantic information is required for entity extraction. We explore why traditional IE approaches are in-sucient to solve the problem. We tested the approach with available open source tools for Entity extraction and describe how our approach is able to solve the problem of entity identification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Copyright c by the paper’s authors. Copying permitted for
private and academic purposes.</p>
      <p>InInP:ProrocceeeeddiinnggssooffIIJJCCAAIIWWoorkrkshshopopononSeSmemanatinctMicaMchaincheiLneeaLrneianrgning (SM(LSM20L1270),17A),uAgu1g91-295-2250210717,,MMeellbboouurrnnee,,AAuustsrtarlaiali.a
data etc. Hence we require a system which should be
robust enough to deal with the issues such as degraded
and un-structured text rather than natural language
text with correct spelling, punctuations and grammar.
Existing information extraction methods are not able
to deal with these requirements as most of the
information extraction tasks work over natural language
text. Since the context of language is missing in
unstructured text, it is dicult to extract the entities
from it and features are based on the natural language
hence it requires semantic processing capabilities to
understand the hidden meaning of content using
dictionaries, ontologies etc.</p>
      <p>Wire text is an example of such kind of text which
is un-formatted and non-grammatic in nature. It can
contain some letters in capital and some in small. For
example people generally write the comments in short
form and use multiple abbreviations. Bank wire text
can be of this following format:</p>
    </sec>
    <sec id="sec-2">
      <title>EVERITT 620122T NAT ABC INDIA LTD</title>
      <p>REF ROBERT REASON SHOP RENTAL
REF 112233999 - REASON SPEEDING FINE
GEM SS HEUTIGEM SCHIENDLER
PENSION CH1234 CAB28</p>
      <p>There are two major challenges in creating the
machine learning model for wire text :
• Non-availability of data set due to confidentiality
• Non-contextual representation of text</p>
      <p>To identify the entities from such kind of text, it
is therefore required special pre-processing of the text
using semantic information of content. In this paper,
we discuss the solution to extract entities from such
kind of text. We evaluate our approach for Bank wire
transfer text and make use of wordnet taxonomy for
identifying the semantics for each of keyword. This
paper is arranged in following sections. In Section 2
we discuss available methods of entity extraction. In
Section 3 we describe the algorithm in detail and
components involved. Section 4 we show the
experimentation results and comparison with open source utilities.
Section 5 is for conclusion &amp; future work.
2</p>
      <sec id="sec-2-1">
        <title>Background</title>
        <p>Supervised machine learning techniques are primary
solutions to solve the named entity recognition
problem which requires data to be annotated. Supervised
methods either learn disambiguation rules based on
discriminative features or try to learn the parameter
of assumed distribution that maximizes the likelihood
of training data. Conditional Random fields [SM12]
is the discriminative approach to solve the problems
which uses sequence tagging. Other supervised
learning models like Hidden Markov Model (HMM) [RJ86],
Decision Trees, Maximum Entropy Models (ME),
Support Vector Machines (SVM) also used to solve the
classification problem. HMM is the earliest model
applied for solving NER problem by Bikel [BSW99] for
English. Bikel introduced a system, IdentiFinder, to
detect NER using HMM as a generative model.
Curran and Clark [CC03] applied the maximum entropy
model to the named entity recognition problem. They
used the softmax approach to formulate. McNamee
and Mayfield [MMP03] tackle the problem as a binary
decision problem, i.e. if the word belongs to one of the
8 classes, i.e. B- Beginning, I- Inside tag for person,
organization, location and misc tags, Thus there are 8
classifiers trained for this purpose. Because of
unavailability of wire text, it is dicult to create the tagged
content hence supervised approaches are not able to
solve the problem.</p>
        <p>Various unsupervised schemes are also proposed to
solve the entity recognition problem. People suggest
the gazetteer based approach which help in
identifying the keywords from the list. KNOWITALL is such a
system which is domain independent and proposed by
Etzioni [ECD+05] that extracts information from the
web in an unsupervised, open-ended manner. It uses
8 domain independent extraction patterns to
generate candidate facts. Manning [GM14] have proposed a
system that generates seed candidates through local,
cross-language edit likelihood and then bootstraps to
make broad predictions across two languages,
optimizing combined contextual, word-shape and alignment
models.</p>
        <p>Semantic Approaches also exists for named entity
extraction. [MNPT02] used the wordnet specification
to identify the W ordClass and W ordInstances list for
each of the word to identify based on predefined rules.
But that list is limited. [Sie15] uses word2Vec
representation of words to define the semantics between
words, that enhances the classification accuracy. It
uses a continuous skipgram model which requires huge
computation for learning word vectors. [ECD+05]
specifiy the gazetteer based feature as external
knowledge for good performance. Given these findings,
several approaches have been proposed to automatically
extract comprehensive gazetteers from the web and
from large collections of unlabeled text [ECD+04] with
limited impact on NER. Kazama [KT07] have
successfully constructed high quality and high coverage
gazetteers from Wikipedia.</p>
        <p>In this paper, we propose the semantic
disambiguation of named entities using wordnet and gazetteer.
Our approach is based on pre-processing the text
before passing it to Named entity recognizer.
Named Entity Recognition involve multiple features
related to the structural representation of entities
hence proper case information imparts a valuable role
in defining the entity type. For example : Person is
generally written in Camel Case in english language
&amp; Organization are in Capitalized format. Our
approach is based on orthogonal properties of entities. It
is based on conversion of input data using wordnet
after looking into the semantics for each of the word and
providing existing NER the converted output. Now
converted text is more probable to extract the Named
entities once provided. We hereby propose the
intermediate layer so called Pre-Processor as shown in
Figure 1. Pre-Processor contains three major
components called WordnetMatcher, GazetteerMatcher and
CaseConverter, whose purpose is to match the text
efficiently with the given content list and converting the
text to required case. LowerCaseConverter,
CamelCaseConverter and UpperCaseConverter are instances
of CaseConverter.</p>
        <p>Tokenizer’s main job is to convert the sentence into
tokens. Named Entity Recognizer is used to extract
the named entities.</p>
        <p>We used Wordnet [Mil95] which provides the
information about synsets. English version contains
129505 words organized into 99642 synsets . In
wordnet two kinds of relations are distinguished: semantic
relations (IS-A , part of etc. ) which hold among
synsets and lexical relations (synonymy , antonymy
) which hold among words. Our gazetteer contains
the dictionary for Person names, Organization names,
Locations etc. Our approach work according to the
following algorithm.
3.2</p>
        <sec id="sec-2-1-1">
          <title>Approach</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Algorithm 1: Semantic NER</title>
      <p>Input : Sentence S as collection of words W
and gazateers ListNames ,
to the WordNet API to get list of SynSets. If synsets
are non-empty, such a word is likely to have some
meaning so it will be checked with Names list first
if found convert it to Camel Case like: John Miller
, Robert Brown. If not found in namesList, later
check in organization list and Location list. If match
found convert to Upper Case otherwise convert in
Camel Case. Now this pre-processed text is having
meaningful representation of entities which is further
passed to Named Entity Recognizer to extract the
entities from the converted text.
3.3</p>
      <sec id="sec-3-1">
        <title>Model Description</title>
        <p>Our Named Entity Recognizer is based on
Conditional Random Field [SM12], which is a discriminative
model. We used cleartk library [BOB14] for model
generation which uses mallet internally for
implementation. Conditional random fields (CRFs) are a
probabilistic framework for labeling and segmenting
sequential data, based on the conditional approach.</p>
        <p>Laferty [LMP+01] define the the probability of a
particular label sequence y given observation sequence
x to be a normalized product of potential functions,
each of the form .</p>
        <p>ListOrganization , ListLocation ,</p>
        <p>ListIgnore exp ( Pj j tj (yi 1, yi, x, i)+Pk ksk(yi, x, i) )
Output: Set of entities ei 2 E
for each wi 2 S do where tj (yi 1, yi, x, i) is a transition feature
funcwi LowerCaseConverter(wi) tion of the entire observation sequence and the labels
if wi 2 / ListIgnore then at positions i and i 1 in the label sequence; sk(yi, x, i)
synsets[] W ordN etM atcher(wi) is a state feature function of the label at position i and
if synsets[] 2 / Empty then the observation sequence; and j and µk are
parameif wi 2 ListNames then ters to be estimated from training data.</p>
        <p>wi CamelCaseConverter(wi) When defining feature functions, we construct a set
end if of real-valued features b(x, i) of the observation to
exelse presses some characteristic of the empirical
distribuif wi 2 ListOrganizationorwi 2 ListLocation tion of the training data that should also hold of the
then model distribution. An example of such a feature is :
wi U pperCaseConverter(wi) b(x, i) is 1 if observatuin at i is ”Person” else 0
else Each feature function takes on the value of one of
wi CamelCaseConverter(wi) these real-valued observation features b(x, i) if the
curend if rent state (in the case of a state function) or previous
end if and current states (in the case of a transition
funcend if tion) take on particular values. All feature functions
end for are therefore real-valued. For example, consider the
(ei) N amedEntityRecognizer(S) following transition function:</p>
        <p>Our algorithm works by looking up the pre-defined
list in multiple steps. For each word in your input,
first it converts to all lower-case, then check the word
against the ignore list containing pronouns,
prepositions, conjunctions and determiners. If it exists then
we ignore the keywords. Else pass the lower-case-word
tj (yi 1, yi, x, i) = b(x,i)
and ,
Fj (y, x) =Pn
i=1 fj (yi 1, yi, x, i)
where each fj (yi 1, yi, x, i) is either a state
function sk(yi, x, i) or a transition function t(yi 1, yi, x, i)
. This allows the probability of a label sequence y
given an observation sequence x to be written as
p(y|x, ) = Z(1x) exp ( Pj j Fj (y, x) )
where Z(x) is a normalization factor.
3.4</p>
      </sec>
      <sec id="sec-3-2">
        <title>Feature Extraction</title>
        <p>We used multiple syntactic and linguistic features
specific to entities. We also used pre-defined list match
as a feature in couple of entities which improves the
accuracy of our model. Our feature selection is based
on following table 1. Explanation for the features is
as follows :
• Preceding: Number of words to be considered for
feature generation before the current word.
• Succeeding: Number of words to be considered for
feature generation after the current word.
• posTag : Part of Speech tag as linguistic feature.
• characterPattern : Character pattern as feature in
token like Camel Case, Numeric, AlphaNumeirc
etc.
• isCapital : True if all the letters are in capitalized
format.
• xxxList : Specific keyword list to match with
the current word.True if word matches.For ex :
orgSux contains list of suxes used in
organization names and middleNames consists the
keywords used in middle name.
4
4.1</p>
      </sec>
      <sec id="sec-3-3">
        <title>Dataset</title>
        <sec id="sec-3-3-1">
          <title>Experimentation Results</title>
          <p>We trained our NER model over MASC (Manually
Annotated Sub-Corpus) dataset [PBFI12] which contains
93232 documents with 3232 di↵erent entities. We used
the bank wire transfer text to verify the approach. Due
to non-availability of bank wire text because of
security reasons, We have to generate test set based on our
client experience and understanding multiple user
scenarios. We implemented the approach to our product
[Pit] which is used by our clients.
Our test dataset contains di↵erent types of comments
which are non-natural in nature. We compare the
approach with existing open source solutions like
Open NLP [Apa14] and Stanford NER [MSB+14]
and we justify that our approach works better due
to the semantic conversion of the text. We observed
that Open nlp is not able to detect much entities
however Stanford NER is able to detect some of them.
Table 2 describes the results of precision, recall and
accuracy for entities Person, Location &amp; Organization.
5</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Conclusion &amp; Future Work</title>
          <p>We hereby proposed the approach for semantic
conversion of bank wire text and extract the entities from
converted text. Currently, we tested our approach for
person, organization and location but it is easily
extensible for other entities like address, contact
number, email information etc. The approach uses
semantic information from wordnet for preprocessing which
can further be used to extract the entities from similar
types of dataset like weblogs, DBlogs, transaction logs
etc.
[Apa14]
[BOB14]</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Apache Software Foundation. openNLP Natural Language Processing Library, 2014. http://opennlp.apache.org/.</title>
    </sec>
    <sec id="sec-5">
      <title>Steven Bethard, Philip Ogren, and Lee</title>
      <p>Becker. Cleartk 2.0: Design patterns for
machine learning in uima. In
Proceedings of the Ninth International
Conference on Language Resources and
Evalua[BSW99]
[CC03]
[Pit]
[Sie15]</p>
    </sec>
    <sec id="sec-6">
      <title>Rebecca J Passonneau, Collin Baker,</title>
      <p>Christiane Fellbaum, and Nancy Ide. The
masc word sense sentence corpus. In
Proceedings of LREC, 2012.</p>
    </sec>
    <sec id="sec-7">
      <title>Pitney Bowes Software CIM Suite http://www.pitneybowes.com/us/customerinformation-management.html.</title>
    </sec>
    <sec id="sec-8">
      <title>L. Rabiner and B. Juang. An introduction</title>
      <p>to hidden markov models. IEEE ASSP
Magazine, 3(2):4–16, Jan 1986.</p>
    </sec>
    <sec id="sec-9">
      <title>Charles Sutton and Andrew McCallum.</title>
      <p>An introduction to conditional random
fields. Foundations and Trends in Machine
Learning, 4(1):267–373, 2012.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>tion (LREC'14)</source>
          , pages
          <fpage>3289</fpage>
          -
          <lpage>3293</lpage>
          , Reykjavik, Iceland,
          <volume>5</volume>
          <fpage>2014</fpage>
          .
          <article-title>European Language Resources Association (ELRA). (Acceptance rate 61%)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Daniel</surname>
            <given-names>M Bikel</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Richard</given-names>
            <surname>Schwartz</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ralph</surname>
            <given-names>M Weischedel.</given-names>
          </string-name>
          <article-title>An algorithm that learns what's in a name</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>34</volume>
          (
          <issue>1-3</issue>
          ):
          <fpage>211</fpage>
          -
          <lpage>231</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>James R.</given-names>
            <surname>Curran</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Clark</surname>
          </string-name>
          .
          <article-title>Language independent ner using a maximum entropy tagger</article-title>
          .
          <source>In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL '03</source>
          , pages
          <fpage>164</fpage>
          -
          <lpage>167</lpage>
          , Stroudsburg, PA, USA,
          <year>2003</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [ECD+04]
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Cafarella</surname>
          </string-name>
          , Doug Downey,
          <string-name>
            <surname>Ana-Maria</surname>
            <given-names>Popescu</given-names>
          </string-name>
          , Tal Shaked, Stephen Soderland, Daniel S Weld, and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Yates</surname>
          </string-name>
          .
          <article-title>Methods for domainindependent information extraction from the web: An experimental comparison</article-title>
          .
          <source>In AAAI</source>
          , pages
          <fpage>391</fpage>
          -
          <lpage>398</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [ECD+05]
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Cafarella</surname>
          </string-name>
          , Doug Downey,
          <string-name>
            <surname>Ana-Maria</surname>
            <given-names>Popescu</given-names>
          </string-name>
          , Tal Shaked, Stephen Soderland, Daniel S. Weld, and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Yates</surname>
          </string-name>
          .
          <article-title>Unsupervised namedentity extraction from the web: An experimental study</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>165</volume>
          (
          <issue>1</issue>
          ):
          <fpage>91</fpage>
          -
          <lpage>134</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [GM14]
          <article-title>[KT07] Sonal Gupta</article-title>
          and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Improved pattern learning for bootstrapped entity extraction</article-title>
          .
          <source>In CoNLL</source>
          , pages
          <fpage>98</fpage>
          -
          <lpage>108</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>Exploiting wikipedia as external knowledge for named entity recognition</article-title>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [LMP+01] John La↵erty,
          <string-name>
            <surname>Andrew</surname>
            <given-names>McCallum</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Fernando</given-names>
            <surname>Pereira</surname>
          </string-name>
          , et al.
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In Proceedings of the eighteenth international conference on machine learning</source>
          , ICML, volume
          <volume>1</volume>
          , pages
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>[Mil95] George</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Wordnet: A lexical database for english</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>38</volume>
          (
          <issue>11</issue>
          ):
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          ,
          <year>November 1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [MMP03]
          <string-name>
            <given-names>James</given-names>
            <surname>Mayfield</surname>
          </string-name>
          ,
          <string-name>
            <surname>Paul McNamee</surname>
            ,
            <given-names>and Christine</given-names>
          </string-name>
          <string-name>
            <surname>Piatko</surname>
          </string-name>
          .
          <article-title>Named entity recognition using hundreds of thousands of features</article-title>
          .
          <source>In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL '03</source>
          , pages
          <fpage>184</fpage>
          -
          <lpage>187</lpage>
          , Stroudsburg, PA, USA,
          <year>2003</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [MNPT02]
          <string-name>
            <given-names>Bernardo</given-names>
            <surname>Magnini</surname>
          </string-name>
          , Matteo Negri, Roberto Prevete, and
          <string-name>
            <given-names>Hristo</given-names>
            <surname>Tanev</surname>
          </string-name>
          .
          <article-title>A wordnetbased approach to named entities recognition</article-title>
          .
          <source>In Proceedings of the 2002 workshop on Building and using semantic networksVolume 11</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . Association for Computational Linguistics,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [MSB+14]
          <string-name>
            <surname>Christopher</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
            , Mihai Surdeanu, John Bauer, Jenny Finkel,
            <given-names>Steven J.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
          </string-name>
          , and
          <string-name>
            <surname>David McClosky</surname>
          </string-name>
          .
          <article-title>The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations</article-title>
          , pages
          <fpage>55</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>[PBFI12] [RJ86] [SM12]</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>