=Paper=
{{Paper
|id=Vol-1986/SML17_paper_5
|storemode=property
|title=Semantic Extraction of Named Entities From Bank Wire Text
|pdfUrl=https://ceur-ws.org/Vol-1986/SML17_paper_5.pdf
|volume=Vol-1986
|authors=Ritesh Ratti,Himanshu Kapoor,Shikhar Sharma,Anshul Solanki,Pankaj Sachdeva
|dblpUrl=https://dblp.org/rec/conf/ijcai/RattiKSSS17
}}
==Semantic Extraction of Named Entities From Bank Wire Text==
Semantic E extraction of Named Entities from Bank Wire text Ritesh Ratti Himanshu Kapoor Shikhar Sharma Pitney Bowes Software Pitney Bowes Software Pitney Bowes Software Noida India Noida India Noida India ritesh.ratti@pb.com himanshu.kapoor@pb.com shikhar.sharma@pb.com Anshul Solanki Pankaj Sachdeva Pitney Bowes Software Pitney Bowes Software Noida India Noida India anshul.solanki@pb.com pankaj.sachdeva@pb.com data etc. Hence we require a system which should be robust enough to deal with the issues such as degraded Abstract and un-structured text rather than natural language text with correct spelling, punctuations and grammar. Online transactions have increased dramati- Existing information extraction methods are not able cally over the years due to rapid growth in dig- to deal with these requirements as most of the infor- ital innovation. These transactions are anony- mation extraction tasks work over natural language mous therefore user provide some details for text. Since the context of language is missing in un- identification. These comments contain infor- structured text, it is difficult to extract the entities mation about entities involved and transfer from it and features are based on the natural language details which are used for log analysis later. hence it requires semantic processing capabilities to Log analysis can be used for fraud analytics understand the hidden meaning of content using dic- and detect money laundering activities. In tionaries, ontologies etc. this paper, we discuss the challenges of en- Wire text is an example of such kind of text which tity extraction from such kind of data. We is un-formatted and non-grammatic in nature. It can briefly explain what wired text is, what are contain some letters in capital and some in small. For the challenges and why semantic information example people generally write the comments in short is required for entity extraction. We explore form and use multiple abbreviations. Bank wire text why traditional IE approaches are in-sufficient can be of this following format: to solve the problem. We tested the approach with available open source tools for Entity ex- EVERITT 620122T NAT ABC INDIA LTD traction and describe how our approach is able REF ROBERT REASON SHOP RENTAL to solve the problem of entity identification. REF 112233999 - REASON SPEEDING FINE GEM SS HEUTIGEM SCHIENDLER 1 Introduction PENSION CH1234 CAB28 Named Entity Extraction is the process of extract- ing entities like Person, Location, Address, Organi- There are two major challenges in creating the zation etc. from natural language text. However, machine learning model for wire text : named entities might also exist in non-natural text like Log data, Bank transfer content, Transactional • Non-availability of data set due to confidentiality Copyright c by the paper’s authors. Copying permitted for private and academic purposes. • Non-contextual representation of text InIn:Proceedings Proceedings of of IJCAI IJCAI Workshop WorkshopononSemantic Machine Semantic MachineLearning Learn- ing (SML (SML 2017), 2017), Aug Aug 19-25 19-25 2017, 2017 Melbourne, Australia. , Melbourne, Australia To identify the entities from such kind of text, it is therefore required special pre-processing of the text using semantic information of content. In this paper, we discuss the solution to extract entities from such kind of text. We evaluate our approach for Bank wire transfer text and make use of wordnet taxonomy for identifying the semantics for each of keyword. This paper is arranged in following sections. In Section 2 we discuss available methods of entity extraction. In Section 3 we describe the algorithm in detail and com- ponents involved. Section 4 we show the experimenta- tion results and comparison with open source utilities. Figure 1: Component Diagram Section 5 is for conclusion & future work. ing combined contextual, word-shape and alignment models. 2 Background Semantic Approaches also exists for named entity extraction. [MNPT02] used the wordnet specification Supervised machine learning techniques are primary to identify the W ordClass and W ordInstances list for solutions to solve the named entity recognition prob- each of the word to identify based on predefined rules. lem which requires data to be annotated. Supervised But that list is limited. [Sie15] uses word2Vec rep- methods either learn disambiguation rules based on resentation of words to define the semantics between discriminative features or try to learn the parameter words, that enhances the classification accuracy. It of assumed distribution that maximizes the likelihood uses a continuous skipgram model which requires huge of training data. Conditional Random fields [SM12] computation for learning word vectors. [ECD+ 05] is the discriminative approach to solve the problems specifiy the gazetteer based feature as external knowl- which uses sequence tagging. Other supervised learn- edge for good performance. Given these findings, sev- ing models like Hidden Markov Model (HMM) [RJ86], eral approaches have been proposed to automatically Decision Trees, Maximum Entropy Models (ME), Sup- extract comprehensive gazetteers from the web and port Vector Machines (SVM) also used to solve the from large collections of unlabeled text [ECD+ 04] with classification problem. HMM is the earliest model ap- limited impact on NER. Kazama [KT07] have suc- plied for solving NER problem by Bikel [BSW99] for cessfully constructed high quality and high coverage English. Bikel introduced a system, IdentiFinder, to gazetteers from Wikipedia. detect NER using HMM as a generative model. Cur- In this paper, we propose the semantic disambigua- ran and Clark [CC03] applied the maximum entropy tion of named entities using wordnet and gazetteer. model to the named entity recognition problem. They Our approach is based on pre-processing the text be- used the softmax approach to formulate. McNamee fore passing it to Named entity recognizer. and Mayfield [MMP03] tackle the problem as a binary decision problem, i.e. if the word belongs to one of the 8 classes, i.e. B- Beginning, I- Inside tag for person, 3 Algorithm organization, location and misc tags, Thus there are 8 3.1 Method classifiers trained for this purpose. Because of unavail- ability of wire text, it is difficult to create the tagged Named Entity Recognition involve multiple features content hence supervised approaches are not able to related to the structural representation of entities solve the problem. hence proper case information imparts a valuable role Various unsupervised schemes are also proposed to in defining the entity type. For example : Person is solve the entity recognition problem. People suggest generally written in Camel Case in english language the gazetteer based approach which help in identify- & Organization are in Capitalized format. Our ap- ing the keywords from the list. KNOWITALL is such a proach is based on orthogonal properties of entities. It system which is domain independent and proposed by is based on conversion of input data using wordnet af- Etzioni [ECD+ 05] that extracts information from the ter looking into the semantics for each of the word and web in an unsupervised, open-ended manner. It uses providing existing NER the converted output. Now 8 domain independent extraction patterns to gener- converted text is more probable to extract the Named ate candidate facts. Manning [GM14] have proposed a entities once provided. We hereby propose the in- system that generates seed candidates through local, termediate layer so called Pre-Processor as shown in cross-language edit likelihood and then bootstraps to Figure 1. Pre-Processor contains three major compo- make broad predictions across two languages, optimiz- nents called WordnetMatcher, GazetteerMatcher and CaseConverter, whose purpose is to match the text ef- to the WordNet API to get list of SynSets. If synsets ficiently with the given content list and converting the are non-empty, such a word is likely to have some text to required case. LowerCaseConverter, Camel- meaning so it will be checked with Names list first CaseConverter and UpperCaseConverter are instances if found convert it to Camel Case like: John Miller of CaseConverter. , Robert Brown. If not found in namesList, later Tokenizer’s main job is to convert the sentence into check in organization list and Location list. If match tokens. Named Entity Recognizer is used to extract found convert to Upper Case otherwise convert in the named entities. Camel Case. Now this pre-processed text is having We used Wordnet [Mil95] which provides the meaningful representation of entities which is further information about synsets. English version contains passed to Named Entity Recognizer to extract the 129505 words organized into 99642 synsets . In word- entities from the converted text. net two kinds of relations are distinguished: semantic relations (IS-A , part of etc. ) which hold among synsets and lexical relations (synonymy , antonymy 3.3 Model Description ) which hold among words. Our gazetteer contains Our Named Entity Recognizer is based on Condi- the dictionary for Person names, Organization names, tional Random Field [SM12], which is a discriminative Locations etc. Our approach work according to the model. We used cleartk library [BOB14] for model following algorithm. generation which uses mallet internally for implemen- tation. Conditional random fields (CRFs) are a proba- bilistic framework for labeling and segmenting sequen- 3.2 Approach tial data, based on the conditional approach. Laferty [LMP+ 01] define the the probability of a Algorithm 1: Semantic NER particular label sequence y given observation sequence Input : Sentence S as collection of words W x to be a normalized product of potential functions, and gazateers ListN ames , each of the form . ListOrganization , ListLocation , P P ListIgnore exp ( j j tj (yi 1 , yi , x, i)+ k k sk (yi , x, i) ) Output: Set of entities ei 2 E for each wi 2 S do where tj (yi 1 , yi , x, i) is a transition feature func- wi LowerCaseConverter(wi ) tion of the entire observation sequence and the labels if wi 2 / ListIgnore then at positions i and i 1 in the label sequence; sk (yi , x, i) synsets[] W ordN etM atcher(wi ) is a state feature function of the label at position i and if synsets[] 2/ Empty then the observation sequence; and j and µk are parame- if wi 2 ListN ames then ters to be estimated from training data. wi CamelCaseConverter(wi ) When defining feature functions, we construct a set end if of real-valued features b(x, i) of the observation to ex- else presses some characteristic of the empirical distribu- if wi 2 ListOrganization orwi 2 ListLocation tion of the training data that should also hold of the then model distribution. An example of such a feature is : wi U pperCaseConverter(wi ) b(x, i) is 1 if observatuin at i is ”Person” else 0 else Each feature function takes on the value of one of wi CamelCaseConverter(wi ) these real-valued observation features b(x, i) if the cur- end if rent state (in the case of a state function) or previous end if and current states (in the case of a transition func- end if tion) take on particular values. All feature functions end for are therefore real-valued. For example, consider the (ei ) N amedEntityRecognizer(S) following transition function: Our algorithm works by looking up the pre-defined tj (yi 1 , yi , x, i) = b(x,i) list in multiple steps. For each word in your input, first it converts to all lower-case, then check the word and , against the ignore list containing pronouns, preposi- Pn tions, conjunctions and determiners. If it exists then Fj (y, x) = i=1 fj (yi 1 , yi , x, i) we ignore the keywords. Else pass the lower-case-word Table 1: Features used for NER Table 2: Comparison Results Entity Type Feature Person preceding = 1 succeeding = 2 , Entity Type Approach Precision Recall Acc. posTag , characterPattern , Person Our Approach 0.65 0.306 0.27 middleNamesList Stanford-NER 0.23 0.175 0.12 Location preceding = 3 succeeding = 3 , Location Our Approach 0.88 0.57 0.53 characterPattern , isCapital Stanford-NER 0.71 0.58 0.51 Organization preceding = 3 succeeding = 3 , Organization Our Approach 0.18 0.32 0.28 posTag , characterPattern , Stanford-NER 0.03 0.018 0.012 orgSuffixList 93232 documents with 3232 di↵erent entities. We used where each fj (yi 1 , yi , x, i) is either a state func- the bank wire transfer text to verify the approach. Due tion sk (yi , x, i) or a transition function t(yi 1 , yi , x, i) to non-availability of bank wire text because of secu- . This allows the probability of a label sequence y rity reasons, We have to generate test set based on our given an observation sequence x to be written as client experience and understanding multiple user sce- narios. We implemented the approach to our product 1 P p(y|x, ) = Z(x) exp ( j j Fj (y, x) ) [Pit] which is used by our clients. where Z(x) is a normalization factor. 4.2 Comparison Our test dataset contains di↵erent types of comments 3.4 Feature Extraction which are non-natural in nature. We compare the We used multiple syntactic and linguistic features spe- approach with existing open source solutions like cific to entities. We also used pre-defined list match Open NLP [Apa14] and Stanford NER [MSB+ 14] as a feature in couple of entities which improves the and we justify that our approach works better due accuracy of our model. Our feature selection is based to the semantic conversion of the text. We observed on following table 1. Explanation for the features is that Open nlp is not able to detect much entities as follows : however Stanford NER is able to detect some of them. Table 2 describes the results of precision, recall and accuracy for entities Person, Location & Organization. • Preceding: Number of words to be considered for feature generation before the current word. 5 Conclusion & Future Work • Succeeding: Number of words to be considered for feature generation after the current word. We hereby proposed the approach for semantic con- version of bank wire text and extract the entities from • posTag : Part of Speech tag as linguistic feature. converted text. Currently, we tested our approach for person, organization and location but it is easily ex- • characterPattern : Character pattern as feature in tensible for other entities like address, contact num- token like Camel Case, Numeric, AlphaNumeirc ber, email information etc. The approach uses seman- etc. tic information from wordnet for preprocessing which • isCapital : True if all the letters are in capitalized can further be used to extract the entities from similar format. types of dataset like weblogs, DBlogs, transaction logs etc. • xxxList : Specific keyword list to match with the current word.True if word matches.For ex : References orgSuffix contains list of suffixes used in organi- [Apa14] Apache Software Foundation. openNLP zation names and middleNames consists the key- Natural Language Processing Library, words used in middle name. 2014. http://opennlp.apache.org/. 4 Experimentation Results [BOB14] Steven Bethard, Philip Ogren, and Lee Becker. Cleartk 2.0: Design patterns for 4.1 Dataset machine learning in uima. In Proceed- We trained our NER model over MASC (Manually An- ings of the Ninth International Confer- notated Sub-Corpus) dataset [PBFI12] which contains ence on Language Resources and Evalua- tion (LREC’14), pages 3289–3293, Reyk- features. In Proceedings of the Seventh javik, Iceland, 5 2014. European Language Conference on Natural Language Learn- Resources Association (ELRA). (Accep- ing at HLT-NAACL 2003 - Volume 4, tance rate 61%). CONLL ’03, pages 184–187, Stroudsburg, PA, USA, 2003. Association for Computa- [BSW99] Daniel M Bikel, Richard Schwartz, and tional Linguistics. Ralph M Weischedel. An algorithm that learns what’s in a name. Machine learn- [MNPT02] Bernardo Magnini, Matteo Negri, Roberto ing, 34(1-3):211–231, 1999. Prevete, and Hristo Tanev. A wordnet- based approach to named entities recogni- [CC03] James R. Curran and Stephen Clark. Lan- tion. In Proceedings of the 2002 workshop guage independent ner using a maximum on Building and using semantic networks- entropy tagger. In Proceedings of the Volume 11, pages 1–7. Association for Seventh Conference on Natural Language Computational Linguistics, 2002. Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, pages 164–167, Strouds- [MSB+ 14] Christopher D. Manning, Mihai Surdeanu, burg, PA, USA, 2003. Association for John Bauer, Jenny Finkel, Steven J. Computational Linguistics. Bethard, and David McClosky. The Stan- [ECD+ 04] Oren Etzioni, Michael Cafarella, Doug ford CoreNLP natural language process- Downey, Ana-Maria Popescu, Tal Shaked, ing toolkit. In Association for Computa- Stephen Soderland, Daniel S Weld, and tional Linguistics (ACL) System Demon- Alexander Yates. Methods for domain- strations, pages 55–60, 2014. independent information extraction from [PBFI12] Rebecca J Passonneau, Collin Baker, the web: An experimental comparison. In Christiane Fellbaum, and Nancy Ide. The AAAI, pages 391–398, 2004. masc word sense sentence corpus. In Pro- [ECD+ 05] Oren Etzioni, Michael Cafarella, Doug ceedings of LREC, 2012. Downey, Ana-Maria Popescu, Tal Shaked, [Pit] Pitney Bowes Software CIM Suite Stephen Soderland, Daniel S. Weld, and http://www.pitneybowes.com/us/customer- Alexander Yates. Unsupervised named- information-management.html. entity extraction from the web: An ex- perimental study. Artificial Intelligence, [RJ86] L. Rabiner and B. Juang. An introduction 165(1):91 – 134, 2005. to hidden markov models. IEEE ASSP Magazine, 3(2):4–16, Jan 1986. [GM14] Sonal Gupta and Christopher D Man- ning. Improved pattern learning for boot- [Sie15] Scharolta Katharina Sienčnik. Adapting strapped entity extraction. In CoNLL, word2vec to named entity recognition. In pages 98–108, 2014. Proceedings of the 20th Nordic Conference [KT07] Junichi Kazama and Kentaro Torisawa. of Computational Linguistics, NODAL- Exploiting wikipedia as external knowl- IDA 2015, May 11-13, 2015, Vilnius, edge for named entity recognition. 2007. Lithuania, number 109, pages 239–243. Linköping University Electronic Press, [LMP+ 01] John La↵erty, Andrew McCallum, Fer- 2015. nando Pereira, et al. Conditional random fields: Probabilistic models for segmenting [SM12] Charles Sutton and Andrew McCallum. and labeling sequence data. In Proceed- An introduction to conditional random ings of the eighteenth international con- fields. Foundations and Trends in Machine ference on machine learning, ICML, vol- Learning, 4(1):267–373, 2012. ume 1, pages 282–289, 2001. [Mil95] George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, November 1995. [MMP03] James Mayfield, Paul McNamee, and Christine Piatko. Named entity recog- nition using hundreds of thousands of