=Paper= {{Paper |id=Vol-1395/paper_20 |storemode=property |title=AMRITA — CEN@NEEL: Identification and Linking of Twitter Entities |pdfUrl=https://ceur-ws.org/Vol-1395/paper_20.pdf |volume=Vol-1395 |dblpUrl=https://dblp.org/rec/conf/msm/BNMRP15 }} ==AMRITA — CEN@NEEL: Identification and Linking of Twitter Entities== https://ceur-ws.org/Vol-1395/paper_20.pdf
           AMRITA - CEN@NEEL : Identification and Linking of
                         Twitter Entities

                Barathi Ganesh H B, Abinaya N, Anand Kumar M, Vinayakumar R, Soman K P
                                    Centre for Excellence in Computational Engineering and Networking
                                             Amrita Vishwa Vidyapeetham, Coimbatore, India
     barathiganesh.hb@gmail.com, abi9106@gmail.com, m anandkumar@cb.amrita.edu
                                                           ¯
                    vinayakumarr77@gmail.com, kp soman@amrita.edu
                                                 ¯
 ABSTRACT                                   bols (:-), #, @user), abbreviations, short words (lol, omg),
 A short text gets updated every now and then. With the                                misspelled words, repeated punctuations and unstructured
 global upswing of such micro posts, the need to retrieve                              words (goooood nightttt, helloooo). Hence these micro posts
 information from them also seems to be incumbent. This                                were fed to the dedicated twitter tokenizer which accounts
 work focuses on the knowledge extraction from the micro                               language identification, a lookup dictionary for list of names,
 posts by having entity as evidence. Here the extracted en-                            spelling correction and special symbols [4][5] for e↵ective to-
 tities are then linked to their relevant DBpedia source by                            kenization.
 featurization, Part Of Speech (POS) tagging, Named En-                                2.2    POS Tagger
 tity Recognition (NER) and Word Sense Disambiguation
 (WSD). This short paper encompasses its contribution to                                 Due to the conversional nature of micro blogs with non-
 #Micropost2015 - NEEL task by experimenting existing                                  syntactic structure it becomes difficult in utilizing general
 Machine Learning (ML) algorithms.                                                     algorithms with traditional POS tags in Penn Treebank and
                                                                                       Wall Street Journal Corpus [6]. O’Conner et al. used 25
                                                                                       POS tagset which includes dedicated tags (@user, hash tag,
 Keywords                                                                              G, URL, etc.) for twitter and reports 90% accuracy on
 CRF, Micro posts, NER                                                                 POS tagging [7]. The ability of resolving independent as-
                                                                                       sumptions and overcoming biasing problems make CRF as
 1.     INTRODUCTION                                                                   promised supervised algorithm for sequence labeling appli-
                                                                                       cations [8]. TwitIE tagger: which utilizes CRF to build the
    Micro posts are a pool of knowledge with scope in busi-
                                                                                       POS tagging model was thus used.
 ness analytics, public consensus, opinion mining, sentimen-
 tal analysis and author profiling and thus indispensable for                          2.3    Named Entity Recognizer
 Natural Language Processing (NLP) researchers. People use
                                                                                          CRF and SVM produced promising outcome for sequence
 short forms and special symbols for easily conveying their
                                                                                       labeling task which prompted us to use the same for our ex-
 message due to the limited size of micro posts which has
                                                                                       periment. Long range dependency of the CRF can also solve
 eventually built complexity for traditional NLP tools [3].
                                                                                       Word Sense Disambiguation (WSD) problem over other graph-
 Though there are number of tools, most of them rely on
                                                                                       ical models by avoiding label and casual biasing during learn-
 least ML algorithms which are e↵ective for long texts than
                                                                                       ing phase. Both CRF and SVM allow us to utilize the com-
 short texts. Thus by providing su↵cient features to these
                                                                                       plicated feature without modeling any dependency between
 algorithms the objective can be achieved. We experimented
                                                                                       them. SVM is also well suited for sequence labeling task
 the NEEL task with the available NLP tools to evaluate
                                                                                       since learning can be enhanced by incorporating cost models
 their e↵ect on entity recognition by providing special fea-
                                                                                       [9]. These advantages provide flexibility in building expres-
 tures available in tweets.
                                                                                       sive models with CRF suite and MALLET tools [10][11].

 2.     SELECTION OF ALGORITHMS
                                                                                       3.    EXPERIMENTS AND OBSERVATION
 2.1      Tokenization                                                                    The experiment is conducted on i7 processor with 8GB
   Tokenizing becomes highly challenging in micro posts due                            RAM and the flow of experiment is shown in Figure 1.
 to the absence of lexical richness. It includes special sym-                          The training dataset consists of 3498 tweets with the unique
                                                                                       tweet id. These tweets have 4016 entities with 7 unique tags
                                                                                       namely Character, Event, Location, Organization, Person,
                                                                                       Product and Thing [1][2]. POS tag for the NER is obtained
                                                                                       from TwitIE tagger after tokenization which takes care of
 Permission
 Copyright to c make
                  2015 digital
                        held by or author(s)/owner(s);
                                    hard copies of all or copying
                                                             part of this work for
                                                                       permitted       the nature of micro posts and provides an outcome desired
 only for or
 personal  private  and academic
              classroom               purposes.
                         use is granted    without fee provided that copies are        by the POS tagger model. The tags are mapped to BIO Tag-
 Published   as  part of the  #Microposts2015        Workshop     proceedings,
 not made or distributed for profit or commercial advantage and that copies
 available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395)                       ging of named entities. Considering the entity as a phrase,
 bear this notice and the full citation on the first page. To copy otherwise, to       token at the beginning of the phrase is tagged as ‘B-(original
 republish, to post on May
 #Microposts2015,      servers  or to2015,
                            18th,     redistribute to lists,
                                            Florence,        requires prior specific
                                                        Italy.
 permission and/or a fee.                                                              tag)’ and the token inside the phrase is tagged as ‘I-(original
 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.                                      tag)’. Feature vector constructed with POS tag and addi-




· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
 tional 34 features like root word, word shapes, prefix and
 suffix of length 1 to 4, length of the token, start and end                        Table 1: Observations
                                                                        Tools     10 Fold-Cross Development              Time
 of the sentence, binary features - whether the word contains
                                                                                   Validation        Data               (mins)
 uppercase, lower case, special symbols, punctuations, first
                                                                    Mallet             84.9          82.4                168.31
 letter capitalization, combination of alphabet with digits,
                                                                    SVM                79.8          76.3                 20.15
 punctuations and symbols, token of length 2 and 4 , etc.
 After constructing the feature vector for individual tokens in     CRFSuite           88.9          85.2                  4.12
 the training set and by keeping bi-directional window of size
 5, the nearby token’s feature statistics are also observed to
 help the WSD. The final windowed training sets are passed         that has direct relation with NER accuracy. The utilized
 to the CRF and SVM algorithms to produce the NER model.           TwitIE tagger shows promising performance in both the to-
 The development data has 500 tweets along with their id and       kenization and POS tagging phases. The special 34 features
 790 entities [1][2]. The development data is also tokenized,      extracted from the tweets improves efficacy by nearing 13%
 tagged and feature extracted as the training data for testing     greater than the model with absence of special features. At
 and tuning the model. The developed model performance             linking part, this work is limited using dot product similarity
                                                                   which could be improved by including semantic similarity.

                                                                   5.    REFERENCES
                                                                   [1] Rizzo, Giuseppe and Cano Basave, Amparo Elizabeth
                                                                      and Pereira, Bianca and Varga, Andrea, Making Sense
                                                                      of Microposts (#Microposts2015) Named Entity
                                                                      rEcognition and Linking (NEEL) Challenge., In 5th
                                                                      Workshop on Making Sense of Microposts
                                                                      (#Microposts2015), pp. 44–53, 2015.
                                                                   [2] Matthew Rowe and Milan Stankovic and Aba-Sah
                                                                      Dadzie, Proceedings, 5th Workshop on Making Sense of
                                                                      Microposts (#Microposts2015): Big things come in small
                                                                      packages, Florence, Italy, 18th of May 2015, 2015.
                                                                   [3] Dlugolinsky S, Marek Ciglan and M Laclavik,
           Figure 1: Overall Model Structure                          Evaluation of named entity recognition tools on
                                                                      microposts, INES, 2013 , pp. 197-202. IEEE, 2013.
 is evaluated by 10- fold cross validation of training set and     [4] Bontcheva K, Derczynski L, Funk A, Greenwood M A,
 validated against the development data. The accuracy is              Maynard D, and Aswani N, TwitIE: An Open-Source
 computed as ratio of total number of correctly identified en-        Information Extraction Pipeline for Microblog Text, In
 tities to the total number of entities and tabulated in Table        RANLP, pp. 83-90, 2013, September.
 1.                                                                [5] Brendan O’Connor, Michel Krieger and David Ahn,
                  P
                      correctly identif ied entities                  TweetMotif: Exploratory Search and Topic
     Accuracy =                                      ⇥ 100 (1)
                            total entities                            Summarization for Twitter, ICWSM, 2010.
    MALLET incorporates O-LBFGS which is well suited for           [6] Tim Finin, Will Murnane, Anand Karandikar, Nicholas
 log-linear models but shows reduced performance when com-            Keller and Justin Martineau, Annotating named entities
 pared to CRFsuite which engulfs LBFGS for optimization               in Twitter data with crowdsourcing, 2010.
 [12][13]. SVM’s low performance can be improved by in-            [7] Kevin Gimpel, et al, Part-of-Speech Tagging for
 creasing the number of features which will not introduce             Twitter: Annotation, Features, and Experiments,
 any over fitting and sparse matrix problem [9].                      HLT’11, 2011.
 The final entity linking part is done by utilizing lookup dic-    [8] John La↵erty,Andrew McCallum and Fernando Pereira,
 tionary (DBpedia 2014) and sentence similarity. The en-              Conditional Random Fields: Probabilistic Models for
 tity’s tokens are given to the look up dictionary which results      Segmenting and Labeling Sequence Data, 2001.
 in few related links. The final link assigned to the entity is    [9] Chun-Nam John Yu, Thorsten Joachims, Ron Elber
 based on maximum similarity score between related links              and Jaroslaw Pillardy, Support vector training of protein
 and proper nouns in the test tweet. Similarity score is com-         alignment models, in Research in Computational
 puted by performing dot product between unigram vectors              Molecular Biology, 2007.
 of proper nouns in the test tweet and the unigram vectors of      [10] Naoaki Okazaki, CRFsuite: a fast implementation of
 related links from lookup dictionary. Entity without related         Conditional Random Fields (CRFs), 2007.
 links is assigned as NIL.                                         [11] McCallum and Andrew Kachites, MALLET: A
                                                                      Machine Learning for Language Toolkit,
 4.   DISCUSSION                                                      http://mallet.cs.umass.edu, 2002.
   This experimentation is about sequence labeling for entity      [12] Galen Andrew and Jianfeng Gao, Scalable Training of
 identification from micro posts and extended with DBpedia            L1-Regularized Log-Linear Models, ICML, 2007.
 resource linking. By observing Table 1, it is clear that CRF      [13] Jorge Nocedal, Updating Quasi-Newton Matrices with
 shows great performance and paves way for building a smart           Limited Storage, Mathematics of Computation, Volume
 NER model for streaming data application. Even though                35, Number151, pp:773-782, 1980.
 CRF seems to be reliable, it is dependent on the feature




                                                                                                                             65
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015