=Paper=
{{Paper
|id=Vol-1395/paper_20
|storemode=property
|title=AMRITA — CEN@NEEL: Identification and Linking of Twitter Entities
|pdfUrl=https://ceur-ws.org/Vol-1395/paper_20.pdf
|volume=Vol-1395
|dblpUrl=https://dblp.org/rec/conf/msm/BNMRP15
}}
==AMRITA — CEN@NEEL: Identification and Linking of Twitter Entities==
AMRITA - CEN@NEEL : Identification and Linking of Twitter Entities Barathi Ganesh H B, Abinaya N, Anand Kumar M, Vinayakumar R, Soman K P Centre for Excellence in Computational Engineering and Networking Amrita Vishwa Vidyapeetham, Coimbatore, India barathiganesh.hb@gmail.com, abi9106@gmail.com, m anandkumar@cb.amrita.edu ¯ vinayakumarr77@gmail.com, kp soman@amrita.edu ¯ ABSTRACT bols (:-), #, @user), abbreviations, short words (lol, omg), A short text gets updated every now and then. With the misspelled words, repeated punctuations and unstructured global upswing of such micro posts, the need to retrieve words (goooood nightttt, helloooo). Hence these micro posts information from them also seems to be incumbent. This were fed to the dedicated twitter tokenizer which accounts work focuses on the knowledge extraction from the micro language identification, a lookup dictionary for list of names, posts by having entity as evidence. Here the extracted en- spelling correction and special symbols [4][5] for e↵ective to- tities are then linked to their relevant DBpedia source by kenization. featurization, Part Of Speech (POS) tagging, Named En- 2.2 POS Tagger tity Recognition (NER) and Word Sense Disambiguation (WSD). This short paper encompasses its contribution to Due to the conversional nature of micro blogs with non- #Micropost2015 - NEEL task by experimenting existing syntactic structure it becomes difficult in utilizing general Machine Learning (ML) algorithms. algorithms with traditional POS tags in Penn Treebank and Wall Street Journal Corpus [6]. O’Conner et al. used 25 POS tagset which includes dedicated tags (@user, hash tag, Keywords G, URL, etc.) for twitter and reports 90% accuracy on CRF, Micro posts, NER POS tagging [7]. The ability of resolving independent as- sumptions and overcoming biasing problems make CRF as 1. INTRODUCTION promised supervised algorithm for sequence labeling appli- cations [8]. TwitIE tagger: which utilizes CRF to build the Micro posts are a pool of knowledge with scope in busi- POS tagging model was thus used. ness analytics, public consensus, opinion mining, sentimen- tal analysis and author profiling and thus indispensable for 2.3 Named Entity Recognizer Natural Language Processing (NLP) researchers. People use CRF and SVM produced promising outcome for sequence short forms and special symbols for easily conveying their labeling task which prompted us to use the same for our ex- message due to the limited size of micro posts which has periment. Long range dependency of the CRF can also solve eventually built complexity for traditional NLP tools [3]. Word Sense Disambiguation (WSD) problem over other graph- Though there are number of tools, most of them rely on ical models by avoiding label and casual biasing during learn- least ML algorithms which are e↵ective for long texts than ing phase. Both CRF and SVM allow us to utilize the com- short texts. Thus by providing su↵cient features to these plicated feature without modeling any dependency between algorithms the objective can be achieved. We experimented them. SVM is also well suited for sequence labeling task the NEEL task with the available NLP tools to evaluate since learning can be enhanced by incorporating cost models their e↵ect on entity recognition by providing special fea- [9]. These advantages provide flexibility in building expres- tures available in tweets. sive models with CRF suite and MALLET tools [10][11]. 2. SELECTION OF ALGORITHMS 3. EXPERIMENTS AND OBSERVATION 2.1 Tokenization The experiment is conducted on i7 processor with 8GB Tokenizing becomes highly challenging in micro posts due RAM and the flow of experiment is shown in Figure 1. to the absence of lexical richness. It includes special sym- The training dataset consists of 3498 tweets with the unique tweet id. These tweets have 4016 entities with 7 unique tags namely Character, Event, Location, Organization, Person, Product and Thing [1][2]. POS tag for the NER is obtained from TwitIE tagger after tokenization which takes care of Permission Copyright to c make 2015 digital held by or author(s)/owner(s); hard copies of all or copying part of this work for permitted the nature of micro posts and provides an outcome desired only for or personal private and academic classroom purposes. use is granted without fee provided that copies are by the POS tagger model. The tags are mapped to BIO Tag- Published as part of the #Microposts2015 Workshop proceedings, not made or distributed for profit or commercial advantage and that copies available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395) ging of named entities. Considering the entity as a phrase, bear this notice and the full citation on the first page. To copy otherwise, to token at the beginning of the phrase is tagged as ‘B-(original republish, to post on May #Microposts2015, servers or to2015, 18th, redistribute to lists, Florence, requires prior specific Italy. permission and/or a fee. tag)’ and the token inside the phrase is tagged as ‘I-(original Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. tag)’. Feature vector constructed with POS tag and addi- · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015 tional 34 features like root word, word shapes, prefix and suffix of length 1 to 4, length of the token, start and end Table 1: Observations Tools 10 Fold-Cross Development Time of the sentence, binary features - whether the word contains Validation Data (mins) uppercase, lower case, special symbols, punctuations, first Mallet 84.9 82.4 168.31 letter capitalization, combination of alphabet with digits, SVM 79.8 76.3 20.15 punctuations and symbols, token of length 2 and 4 , etc. After constructing the feature vector for individual tokens in CRFSuite 88.9 85.2 4.12 the training set and by keeping bi-directional window of size 5, the nearby token’s feature statistics are also observed to help the WSD. The final windowed training sets are passed that has direct relation with NER accuracy. The utilized to the CRF and SVM algorithms to produce the NER model. TwitIE tagger shows promising performance in both the to- The development data has 500 tweets along with their id and kenization and POS tagging phases. The special 34 features 790 entities [1][2]. The development data is also tokenized, extracted from the tweets improves efficacy by nearing 13% tagged and feature extracted as the training data for testing greater than the model with absence of special features. At and tuning the model. The developed model performance linking part, this work is limited using dot product similarity which could be improved by including semantic similarity. 5. REFERENCES [1] Rizzo, Giuseppe and Cano Basave, Amparo Elizabeth and Pereira, Bianca and Varga, Andrea, Making Sense of Microposts (#Microposts2015) Named Entity rEcognition and Linking (NEEL) Challenge., In 5th Workshop on Making Sense of Microposts (#Microposts2015), pp. 44–53, 2015. [2] Matthew Rowe and Milan Stankovic and Aba-Sah Dadzie, Proceedings, 5th Workshop on Making Sense of Microposts (#Microposts2015): Big things come in small packages, Florence, Italy, 18th of May 2015, 2015. [3] Dlugolinsky S, Marek Ciglan and M Laclavik, Figure 1: Overall Model Structure Evaluation of named entity recognition tools on microposts, INES, 2013 , pp. 197-202. IEEE, 2013. is evaluated by 10- fold cross validation of training set and [4] Bontcheva K, Derczynski L, Funk A, Greenwood M A, validated against the development data. The accuracy is Maynard D, and Aswani N, TwitIE: An Open-Source computed as ratio of total number of correctly identified en- Information Extraction Pipeline for Microblog Text, In tities to the total number of entities and tabulated in Table RANLP, pp. 83-90, 2013, September. 1. [5] Brendan O’Connor, Michel Krieger and David Ahn, P correctly identif ied entities TweetMotif: Exploratory Search and Topic Accuracy = ⇥ 100 (1) total entities Summarization for Twitter, ICWSM, 2010. MALLET incorporates O-LBFGS which is well suited for [6] Tim Finin, Will Murnane, Anand Karandikar, Nicholas log-linear models but shows reduced performance when com- Keller and Justin Martineau, Annotating named entities pared to CRFsuite which engulfs LBFGS for optimization in Twitter data with crowdsourcing, 2010. [12][13]. SVM’s low performance can be improved by in- [7] Kevin Gimpel, et al, Part-of-Speech Tagging for creasing the number of features which will not introduce Twitter: Annotation, Features, and Experiments, any over fitting and sparse matrix problem [9]. HLT’11, 2011. The final entity linking part is done by utilizing lookup dic- [8] John La↵erty,Andrew McCallum and Fernando Pereira, tionary (DBpedia 2014) and sentence similarity. The en- Conditional Random Fields: Probabilistic Models for tity’s tokens are given to the look up dictionary which results Segmenting and Labeling Sequence Data, 2001. in few related links. The final link assigned to the entity is [9] Chun-Nam John Yu, Thorsten Joachims, Ron Elber based on maximum similarity score between related links and Jaroslaw Pillardy, Support vector training of protein and proper nouns in the test tweet. Similarity score is com- alignment models, in Research in Computational puted by performing dot product between unigram vectors Molecular Biology, 2007. of proper nouns in the test tweet and the unigram vectors of [10] Naoaki Okazaki, CRFsuite: a fast implementation of related links from lookup dictionary. Entity without related Conditional Random Fields (CRFs), 2007. links is assigned as NIL. [11] McCallum and Andrew Kachites, MALLET: A Machine Learning for Language Toolkit, 4. DISCUSSION http://mallet.cs.umass.edu, 2002. This experimentation is about sequence labeling for entity [12] Galen Andrew and Jianfeng Gao, Scalable Training of identification from micro posts and extended with DBpedia L1-Regularized Log-Linear Models, ICML, 2007. resource linking. By observing Table 1, it is clear that CRF [13] Jorge Nocedal, Updating Quasi-Newton Matrices with shows great performance and paves way for building a smart Limited Storage, Mathematics of Computation, Volume NER model for streaming data application. Even though 35, Number151, pp:773-782, 1980. CRF seems to be reliable, it is dependent on the feature 65 · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015