A Deep Learning Approach towards Cross-Lingual Tweet Tagging Nikhil Bharadwaj Gosala Shalini Chaudhuri Monica Adusumilli BITS Pilani, Hyderabad BITS Pilani, Hyderabad BITS Pilani, Hyderabad nikhil.gosala@gmail.com shalini_chaudhuri adusumillimonica@gmail.com @yahoo.co.in Kartik Sethi BITS Pilani, Hyderabad kartik1295@gmail.com ABSTRACT of these kinds of tweets. This is the motivation behind the Named Entity Recognition (NER) is important in analysing choice of exploring the field of Artificial Neural Networks. the context of a statement and also the sentiments associated Thus, to make an effort in efficiently tagging tweets by with it. Although Twitter Data is noisy, it is valuable due to modelling a very basic version of the human brain, Recurrent the amount of information it can provide. Therefore, NER Neural Networks, especially the Long Short Term Memory for Twitter Data is necessary. Our model aims to extract (LSTM) model was used. the named entities from tweets using a Recurrent Neural Network Core. Long Short Term Memory (LSTM) was used 2. RELATED WORK to learn long term dependencies in our supervised learning Most of the existing NER taggers are based on linear model. The sequence-to-sequence architecture was used in statistical models like the Hidden Markov Model [3] and the implementation of our supervised learning model. Conditional Random Fields [2]. More recently, owing to its promising results in sequence tagging tasks, Convolutional CCS Concepts Neural Networks has gained a lot of attention for the task of Named Entity Recognition. The use of RNN and especially •Information systems → Information extraction; LSTMs have been extensively discussed by [1] wherein they demonstrate the amazing performance of RNNs. Keywords Recurrent Neural Network, Tweet Tagging 3. SYSTEM DESCRIPTION The aim of the task was to tag twitter data that contained 1. INTRODUCTION a mixture of both Hindi and English tokens. The system can Sequence Tagging, especially Named Entity Recognition be subdivided into the following three modules: (NER) has been, for a very long time, a classic NLP task 1. Pre-Processing [1]. A lot of research has been directed towards it for the past couple of decades. The output of the NER module, 2. RNN Core the tagged entities, play a significant role in determining the working of many other applications. For instance, these tags 3. Post-Processing are widely used in measuring the sentiment in a sequence of posts, finding the context of a message, and identifying key 3.1 Pre-Processing elements referred to in a set of documents. These tags could Pre-processing is an inevitable step that should be adopted be very generic (such as ’noun’) or specific (such as ’name before processing any kind of data to remove the unwanted of person’) depending on the task at hand. Generic tags are values and reduce the noise in the dataset. Adhering to its usually helpful in learning the structure and automatically definition, the pre-processing phase was used to clean and generating new sentences in an unknown language. On the structure the data into a form that could be read by our other hand, specific tags are widely used by search engines Tagging Model. The pre-processing phase comprises of the to generate user and product specific advertisements. following stages: Twitter data stores a lot of information that, when ex- 1. Removal of HTML Escape Characters: It was tracted and processed properly, can offer a great deal of observed that many of the HTML characters were not knowledge. They are the most up-to-date and inclusive replaced by their system equivalent characters. For sources of information that is currently available on the in- example, & was present instead of the normal & ternet largely due to its low-barrier of entry, and the wide token. Such HTML escape characters were taken care use of mobile devices [4]. Although tweets follow basic gram- of by using the ’html’ package in Python. mar rules, they are both extremely noisy, and difficult to comprehend. Due to this very basic nature of tweets, many 2. Tweet Tokenization: The tweets were tokenized traditional NER tools fail miserably in tagging them. On the using Regular Expressions. As is with Twitter Data, contrary, the human brain does a great job in making sense some words/tokens can easily be tokenized by looking at their regular expression. For instance, Twitter Han- 3.2 RNN Core dles always start with an @ and any token that starts Upon studying various models for NER tagging, Deep with @ is a twitter handle. This observation helped us Learning and especially Recurrent Neural Networks (RNNs) in tokenizing the twitter data with great effect. The was chosen for the task of Tweet Tagging. In RNN, there following are a list of all the regular expressions used were multiple models available and of all the models, we de- to tokenize the tweet: cided to go with the Sequence-to-Sequence (seq2seq) model. In seq2seq model, each input token has a corresponding tag (a) Emoticons: r”(?: [:=;] [oO\-]? [D\)\]\(\]/)” associated with it. This feature of seq2seq model was con- (b) HTML Tags: r’<[ˆ>]+>’ sistent with that of the twitter data provided and thus was used for tweet tagging (Each token in the twitter data had (c) Twitter Handles: r’(?:@[\w ]+)’ a corresponding tag. If it did not, we assigned a custom tag (d) Hash Tags: r’(?:\#+[\w ]+[\w\’ \-]*[\w ]+)’ to it). LSTMs are special kind of RNNs that are capable of learn- (e) Whitespaces: r’[\n\t\r ]+’ ing long-term dependencies. An LSTM cell has multiple (f) URLs: r’http[s]?://(?:[a-z]|[0-9]|[$- @.& +] gates that define which data to be retained and which data to be forgotten. By training the weights of these gates, one can (g) Numbers: r’(?:(?:\d+,?)+(?:\.?\d+)?)’ control the amount of data to be retained and the amount (h) Words with ’ and -: r”(?:[a-z][a-z’\- ]+[a-z])” of data to be forgotten. In our implementation, each node in the RNN was a GRU cell. A GRU cell is very similar (i) Other Words: r’(?:[\w ]+)’ in function to an LSTM cell, but is computationally much (j) Everything Else: r’(?:\S)’ more efficient. Keeping efficiency in mind, GRU cell was chosen over an LSTM cell. 3. Stop Word Removal: Stop Words are those words A RNN consists of multiple hidden layers and each layer that occur far too frequently to have any effect on the contains multiple nodes. Each node is like a neuron in the classification or tagging task. Stop words were tack- human brain that can retain some information and can make led by using the Stop Words corpus from NLTK and decisions based on this information. The complexity of the appending it with words and punctuations that occur model can be varied by changing the number of nodes per far too frequently. layer or the number of hidden layers in a model. The supervised approach was chosen to train the RNN. 4. Unicode Emoji Removal: Some of the emojis In the supervised model, the desired output (target data) could not be captured using the regular expressions. is provided along with the training data. The network then For instance, Ÿ‘a and Ÿ‘a , could not be captured us- processes the inputs and compares its resulting outputs against ing regular expressions. For such cases, Unicode ranges the desired outputs. Errors are propagated back through the were used to strip the tweet of emojis. system, causing the system to adjust the weights that con- trol the RNN. This process occurs multiple times and the 5. Rule Tagging: Owing to the structure of twitter weights are continuously tweaked. For reducing the error, data, some of the tokens can be directly tagged based Adam Optimizer was used instead of the more common Gra- on Regular Expressions. For instance, any token that dient Descent or the Stochastic Gradient Descent Algorithm. begins with a # can be categorized as a Hash Tag This decision was made because the TensorFlow implemen- and any token that begins with @ is a Twitter Han- tation of Adam Optimizer uses the moving averages of the dle. Such tokens were tagged using regular expres- parameters to tweak the weights. The main advantage of sions and custom tags. The custom tokens added to this approach is that it has a large step size and thus con- the corpus were - HTML TAG, TWITTER HANDLE, verges much faster than the Gradient Descent Algorithm. HASH TAG, WHITESPACE, URL, EMOTICON, RT and OTHER. 3.3 Post-Processing After the model was trained, it was used to predict the 6. Common Misspelling Mapping: Owing to the data. But because the model was not 100% accurate, some 140-character limit of Twitter and the widespread use of the tokens were left untagged (i.e. they were tagged with of SMS lingo, many tweets consist of common SMS a custom token OTHER). These tokens were then checked lingo. For example, the word for is commonly written with the Token Lists and any untagged token that was found as 4 and the word because is written as bcoz, coz, bcz in the token list was tagged with the corresponding token. and so on. All such misspellings were mapped to the This step also included removal of tags that were not a correct spelling of the word. This was done to reduce part of the annotated tags file. For instance, custom tags like the number of unique words in the corpus. URL, HASH TAG, TWITTER HANDLE and so on were removed from the final output file to keep the output file 7. Token List Generation: The pre-tagged text was consistent with that of the given annotated file. added to different lists based on the tag. These lists Apart from the earlier steps, this step also included a func- were used later in the tagging process to tag tokens tion to merge any two consecutive tokens having the same that could not be tagged by the model. tag into a single word. As an example, Nikhil Bharadwaj would have been composed of two tokens Nikhil and Bharad- The output of the Pre-Processing module was a vert file waj with the same tag. Because both these tokens are con- that is commonly used in many commercially available Part- secutive and have the same tag (NAME ), they are merged of-Speech taggers. and the phrase Nikhil Bharadwaj is tagged as NAME. 4. EVALUATION AND RESULTS Two runs were performed to tag the twitter data. Both the runs used the RNN model but the parameters i.e. the number of hidden layers and number of nodes per layer were modified. The learning rate in both the runs was set to 0.003 and the decay rate was set at 0.97. In Run 1, 3 layers were used with each layer having 192 nodes and in Run 2, 4 layers were used with each layer having 256 nodes. In the case of the Run 1, the final error, after all the iterations was around 0.6 whereas in Run 2, the final error was around 0.45. The results obtained by using this model were very en- couraging. An accuracy of 59.28% and a recall of 19.64% was achieved with an F1 score of 29.50 in the case of Run 1. For Run 2, an accuracy of 61.80% and a recall of 26.39% was achieved with an F1 score of nearly 37. These num- bers show that the more complex model (Run 2) was better in capturing a lot more information than the less complex model. The future direction of research focuses on improving the accuracy by making the annotated data much more compre- hensive. While analysing the tagged output, it was observed that a lot of tokens were tagged as OTHER. This was due to the fact that most of the tokens in the corpus did not have any tag. This meant that the model was extremely bi- ased towards assigning OTHER to any unknown token. This problem was made less severe by adding custom tags to the data based on Regular Expressions. For instance, any token beginning with a # is destined to be a hash tag. So the token HASH TAG was assigned to it. It was also observed that the words in the corpus were not frequent enough. The model, thus, could not learn a lot of information about such words. This issue could be resolved by either using external tagged data, or by making the corpus much more compre- hensive. 5. REFERENCES [1] Z. Huang, W. Xu, and K. Yu. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015. [2] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning, ICML, volume 1, pages 282–289, 2001. [3] A. McCallum, D. Freitag, and F. C. Pereira. Maximum entropy markov models for information extraction and segmentation. In Icml, volume 17, pages 591–598, 2000. [4] A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524–1534. Association for Computational Linguistics, 2011.