Hierarchical classification for Multilingual Language Identification and Named Entity Recognition Saatvik Shah Vaibhav Jain Sarthak Jain CSE, MNIT, Jaipur CSE, MNIT, Jaipur ECE, MNIT, Jaipur Rajasthan-302017 Rajasthan-302017 Rajasthan-302017 saatvikshah1994@gmail.com vj.25071996@gmail.com sarthakssj@gmail.com Anshul Mittal Jatin Verma Shubham Tripathi EE, MNIT, Jaipur EE, MNIT, Jaipur EE, MNIT, Jaipur Rajasthan-302017 Rajasthan-302017 Rajasthan-302017 anshulmittal712@gmail.com jatin.verma.205@gmail.com stripathi1770@gmail.com Rajesh Kumar EE, MNIT, Jaipur Rajasthan-302017 rkumar.ee@gmail.com ABSTRACT lem. This research paper addresses language identification This paper describes the approach for Subtask-1 of the FIRE- (LI) and Named Entity Recognition (NER) for text in social 2015 Shared Task on Mixed Script Information Retrieval. media. Such kind of text could involve, for example, switch- The subtask involved multilingual language identification ing of language in between a sentence (code switching), or (including mixed words and anomalous foreign words), named even include words of mixed languages (code mixing). In entity recognition (NER) and subclassification. The pro- addition to this, people also use phonetic typing and irreg- posed methodology starts with cleaning the data and then ular spellings in social media (‘2nyt’ in place of ‘tonight’ or extracting structural and contextual features from the text ‘y’ in place of ‘why’). All these challenges make the correct for further processing. A subset of these features is selected identification of language in data from social media a tough (based on validation) for training supervised classifiers, sep- task. arately for language identification and NER. Finally, they The task required rigorous preprocessing of both the train- are applied hierarchically to annotate the entire text. The ing (validation) as well as testing dataset. Section 2 de- detected named entities are further subclassified by a novel scribes, in detail, the approach followed for it. Rest of the unsupervised technique based on query refinement and key- paper is organized as follows. Section 3 describes all the word based scoring. The proposed approach on the testing features extracted. Section 4,5 and 6 describe the hierar- dataset of the shared task showed promising results with chical classification approach that we implemented for tag- a weighed F-measure of 0.8082. However, it is worth not- ging of tokens into their respective categories. Section 7 ing that the classifiers have been sub-optimal with respect and 8 cover the Results and Conclusion. A Web Tool has to discriminating between certain linguistically similar lan- been developed for the implemented approach and can be ac- guages (for e.g., Gujarati in Hindi and Gujarati pairs). The cessed at https://mixscian.herokuapp.com. The source proposed approach is flexible and robust enough to handle code has been shared on the link, https://github.com/ additional languages for identification as well as anomalous saatvikshah1994/hline for future research purposes. foreign or extraneous words. The implementation of the approach has also been shared for the purpose of future re- 2. PREPROCESSING search usage. During training, the following steps were performed to achieve preprocessing: Keywords 1. Began preparing the data by skipping all the X’s present Language Identification; NER;Information Retrieval; in the Input file, which are punctuation, numerals, Wikipedia; Query Refinement; Ensembling; emoticons, mentions (words starting with @), hashtags (words starting with #) and acronyms; with respect to 1. INTRODUCTION the given annotation file directly. The ubiquitous use of indigenous languages in social me- 2. Then Unicode based cleanup of words was performed, dia and search engines has led to the increasing need for to bring Unicode characters to their closest matching techniques to tackle mixed script information retrieval. In ASCII representatives. (Eg. “ZineTM ” to ZineTM) such an environment, language and named entity identifica- 3. After that, each utterance was formatted to remove tion has become both a necessary and challenging task. At any punctuation marks in between letters. this juncture, the shared task for mixed script information retrieval for the FIRE workshop serves as an excellent source 4. Finally, erroneous utterances (with unequal number of to obtain multiple insights to the solution to such a prob- words and labels) were discarded. 33 8. Compressed Word Normalization: The output of Word Normalization is compressed to remove repeti- tions of similar characters appearing together. For ex- ample ”AaaaAAa000” becomes ”AaAa0”. 9. Composition features:A feature was defined whose value depended upon the presswork of the word. The following four values were possible: • AllCaps (if the current word is made up of all capitalized letters), • AllSmall (word is constructed with only lowercase characters), • WordDigit (word contains digits and alphabets both) • InitCaps (whether the first character is in upper- case) and These features, having variable length for every token under consideration, had to be further processed into a Bag-Of- Words formation. Term Frequency Matrix was computed Figure 1: Hierarchical Classification Workflow for the same, resulting in the formation of sparse matrices, which were finally passed onto the classifiers. During testing, the process started with performing Uni- code based cleanup, followed by removal of punctuations or 4. PUNCTUATION (X) RECOGNITION any other invalids in between letters from tokens. This section describes the methodology used for tagging Xs, or tags which are punctuation marks, numerals, men- 3. FEATURE EXTRACTION tions, hashtags, acronyms or emoticons. The process begin by a rule based approach for isolating the hashtags, men- The proposed technique uses a comprehensive set of fea- tions, and numerals. This is done by checking for tokens tures [1] for both language identification and NER. These that start with ‘@’, or ‘#’, or tokens that are composed en- are explained in brief below: tirely of digits. Then, the tokens are checked for web URLs 1. Word Context: Word Context features that can be or email addresses, and punctuations. A regular expression extracted from a token (word) being labeled and its based classification was implemented to search for tokens surrounding context. Considering w0 as current token, starting with ‘http://’ or ‘https://’, or any token of the we took Subset of all features for upto 2 context words, format ‘abc@pqr.xyz’. Also, any tokens that are com- both in the left and right (w-2 , w-1 , w0 , w+1 , w+2 ). posed only of punctuation marks are isolated and tagged as 2. Prefix and Suffix: Prefix and suffix, defined as vari- X. Another regular expression check is applied to catch any able length character sequences (here, 3 and 2), are emoticon such as ‘:-)’, ‘;)’, ‘:}’, etc. A dictionary based ap- stripped from each token and used as the features of proach is then applied to check for any token to exist in a list classifier. of popular acronyms. If there is a match, the corresponding 3. Stemming/Lemmatization: A step applying stem- token is also classified as an X. Once these ’X’s have been ming or lemmatization of input tokens or both before filtered out, a cleaning operation was performed on the re- extracting features was added to the feature extraction maining tokens. Cleaning is defined as converting Unicode pipeline and tested. characters to the nearest matching ASCII characters. For example, converting “ZineTM ” to ZineTM. After this pre- 4. Character level n-gram: Character n-gram is a con- processing, these ‘cleaned’ tokens are sent for Named Entity tiguous sequence of n character extracted from a given Recognition. word. We extract character n-grams of length two (bi- gram) and three (trigram), quad(four), penta(five) and use these as features. 5. NAMED ENTITY RECOGNITION 5. POS tags: POS tagging(or grammatical tagging), This task focuses on tagging those tokens as named entity marks part of speech of a word present in corpus based which can be names of persons, locations, organizations etc on both definition and context. [2] in the given text. For each named-entity, we can: 6. Relative Position:The relative position of specific to- ken in its utterance was considered and added as a 1. Classify it as {NE}, OR feature. 2. Sub-classify it as {NE L, NE P, NE O, NE PA, 7. Word Normalization: Word Normalization is done NE LA, NE OA, NE Ls , NE X, NE XA} to capture similarity between two words with com- mon properties like ”HahaHHa123” is normalized to Methodology: Tagging of named entities has been divided ”AaaaAAa000”, with ”A” denoting a capital, ”a” denot- into two steps:(1)Supervised approach for binary classifica- ing non capital and ”0” denoting a numerical character. tion and (2)Unsupervised approach for sub-classification. 34 5.1 Supervised Approach 3. Using the object, all the above-mentioned informa- The Supervised approach includes training of Classifiers tion is concatenated to form a single string denoted [3], namely, Linear-kernel SVM(using liblinear), Logistic Re- by wiki-extract. For convenience, the information is gression and Random Forest Model on the extracted fea- first converted into lowercase, and any of the punctu- tures explained in section 3. The input data for train- ation marks, numerals, Unicode characters, etc. are ing included this year’s training data along with dataset of removed. Each of the words in wiki-extract is lem- FIRE 2014. All these were implemented by use of Trans- matized, and to counter the internet dependence and form Pipelines containing Clean Transform(to clean data) decrease the processing time, the final wiki-extract is and Feature Extractor. Finally, the use of Term Frequency cached in a file with the corresponding token against Matrix was done to obtain numeric features. Based on indi- it. vidual feature cross-validation, the most important features 4. The reference set keywords are searched in wiki-extract. that contributed in identifying the Named Entities were: The number of times a keyword is spotted in wiki- • Word(and Compressed) Normalization extract will be the score of the subclass (‘NE P’, ‘NE L’ • Local Context(with a smaller window) or ‘NE O’) to which the keyword belongs. 5. After each of the keywords’ search is completed, we get • Relative Position a final score for each subclass. If the difference between • Prefix/Suffix the scores of the subclass with the maximum score and • POS Tags. that with the second maximum score is greater than or • Composition Features. equal to a confidence threshold (5 in this paper), then An Ensemble Model of NER was developed which per- token is annotated as the subclass with the maximum formed a ”LOGIC OR” on the predicted results of each clas- score, otherwise it is annotated as ‘en’. If the final sifier to decide the final tag for a particular token. One(1) scores of all the subclasses is zero, it is marked as ‘en’. was used to denote a token being a named entity and Zero(0) If in the 2nd step, Wikipedia shows disambiguation error, for non-named entity. then token is searched on Bing using its API for python T ag out = 0 OR (T ag classifier 1 ) OR (T ag classifier 2 ).... named ‘py bing search’. As long as the data under process OR (T ag classifier n ) (1) belongs to social networking sites including popular topics like politics, movies, etc., the results by Bing will help in where finding out the most popular/probable meaning of token out Tagout is the predicted result for each token, of many ambiguous cases. Steps to be followed are:- Tagclassifier k is the predicted tag by classifier k, and i Using ‘py bing search’, token appended with the word n is number of classifiers used. ‘Wikipedia’ (for e.g. ‘AAP wikipedia’) is searched on 5.2 Unsupervised Approach Bing so that it returns URLs of Wikipedia pages at top priority. Only first three URLs are considered. An Unsupervised approach based on parsing Wikipedia to find out whether the predicted named entity is a per- ii Only those URLs that point towards a Wikipedia page son, location or organization was employed [6]. Searching except disambiguation pages are considered. the Wikipedia page of a named entity provides information iii As soon as the desired URL is obtained, the proper comprising of the title of the Wikipedia Page, the summary, term that would not cause disambiguation error and the section headings, the infobox contents, the categories could replace token is extracted out of that URL. For section, etc. indicating whether it is a person, location or e.g. if the token was ‘AAP’. Bing will return a URL organization. The prediction is done using three reference ‘https://en.wikipedia.org/wiki/Aam Aadmi Party’ for sets each containing some keywords that are commonly used this token and then the term ‘Aam Aadmi Party’ ex- in the context of a person, location or organization. For tracted out from this URL will replace token. a person, keywords can be ‘born’, ‘Education’, ‘President’, iv Now, rest of the steps of searching the new token on etc.; for a location, they can be ‘place’, ‘nation’, etc.; for an Wikipedia are same as step 2 to step 5 discussed above. organization, they can be ‘party’, ‘company’, etc. While parsing the Wikipedia page of ‘Narendra Modi’, we will If, among the three URLs, none of them satisfies the condi- surely get across the word ‘President’ declaring ‘Narendra tion of being a Wikipedia page except disambiguation pages, Modi’ as ‘NE P’ or person. However, it was found that then under disambiguation error there is an option to fetch for some named entities like AAP, Mark, etc., it displays a the options offered by Wikipedia disambiguation page. Only disambiguation page whereas some named entities like So- first five options are considered in this paper. These five op- nia, Pradep, etc. didn’t even have a Wikipedia page. Thus, tions are concatenated to form wiki-extract and rest of the a systematic approach overcoming all the shortcomings of procedure is same as step 3 to step 5 keeping the confidence only Wikipedia parsing was formulated which is described threshold zero. in the following paragraph. If in the 2nd step, Wikipedia shows page error, then to Assuming the named entity to be of one word, let the replace token with a proper term that would not cause page named entity that is to be sub-classified be denoted by a error, Bing search is used again. Hence, the steps to be fol- variable token. The following procedure prevails:- lowed are same as step i to iv. Again, if among the three 1. First of all, any of the punctuation marks, numerals or URLs obtained from ‘py bing search’, none of them satisfies special characters are removed from token. the condition of being a Wikipedia page except disambigua- 2. The token is searched using Wikipedia API for python tion pages, then the token is annotated as ‘en’. that returns an object containing the information about Apart from the above algorithm, it is constantly sought token extracted from its Wikipedia page. that if the term succeeding token also comes out to be named 35 P, R, F-S Bn En Gu Hi Kn Ml Mr Ta Te X NE NE L NE P P 0.878 0.958 0.097 0.817 0.575 0.394 0.705 0.937 0.431 0.961 0.368 0.722 0.2121 R 0.966 0.848 0.5 0.74 0.829 0.752 0.79 0.708 0.687 0.966 0.528 0.124 0.25 F-S 0.838 0.9 0.163 0.776 0.679 0.517 0.745 0.806 0.529 0.964 0.433 0.214 0.229 Table 1: Strict Precision (P), Recall (R), F-scores (F-S) for languages and NEs. entity, provided that token is not positioned last in the cur- run was submitted for Subtask-1. The overall performance rent utterance, then they are treated as a single entity and was measured in terms of Utterance, Token and NE accu- then searched on Wikipedia. For e.g. the tokens are ‘White’ racies along with precision, recall and F-measure scores for ‘House’ which are marked as named entity. If we search individual languages and sub-divided NE. The Average F- ‘White’ on Wikipedia we won’t get desired results. But if measure and Weigthed F-measure scores obtained was we search ‘White House’, we may surely get it as a loca- 0.654 and 0.808 respectively. While generally the Mean tion, and thus both ‘White’ and ‘House’ are annotated as Average F-measure, Weigthed F-measure score was ‘NE L’. If the tokens are misspelled or ambiguous, then 0.5395, 0.6990 and Max Average F-measure, Weigthed Wikipedia may cause page or disambiguation error. In that F-measure score was 0.6917, 0.8299. case, the combined token is first searched on Bing to find out the term that could replace the combined token. If this 8. CONCLUSION results in failure in finding out a Wikipedia page except dis- In this paper, a system was described for Subtask-1 i.e. ambiguation pages, then the process is reverted and the two Query Word Labelling in FIRE Shared Task 2015 on Mixed tokens are searched individually again following the proce- Script Information Retrieval. The validation procedure indi- dure discussed earlier. cated that the context of a token, its POS tag, its structure and normalization play key roles in both NER and Language 6. LANGUAGE IDENTIFICATION identification. It is interesting to note that the performance This section addresses the problem of identification of lan- on linguistically dissimilar languages such as English and guage of origin of a query term in the code-mixed query [4]. Bengali has been top notch. On the other hand, perfor- For Language Identification, linear-kernel SVM(using liblin- mance on similar language pairs such as Hindi and Marathi ear) classifier was used with important features being: have invariably lead to the classifier getting confused. The • Word n-grams(bi-grams, tri-grams, quad-grams and use of Search engines to refine query terms followed by use penta-grams) of publically available encyclopedias such as Wikipedia for • Local Knowledge subclassifying named entities is a newly proposed technique that has given promising results. • Part of Speech Tags • Composition Features 9. REFERENCES Parameter optimization was performed using parameter tun- ing based on cross-validation. Supervised approach[5] was [1] Gupta, D. K., Kumar, S., & Ekbal, A. Machine Learn- implemented with the classifier being trained on current ing Approach for Language Identification & Transliteration: year’s training dataset and previous year’s dataset. A sim- Shared Task Report of IITP-TS. ilar transform pipeline as used for NER was implemented [2] Abhinaya N., Neethu John, Dr. M. Anand Kumar, Dr. for the training of classifiers. A multilevel classification was K.P. Soman (2014). Amrita @ FIRE-2014: Named Entity performed with possible tags for each token: Recognition for Indian Languages [3] Dubey, S., Goel, B., Prabhakar, D. K., & Pal, S. ISM@ 1. One of Languages in L={(en), (hi), (bn), (gu), (ka), FIRE-2014: Named Entity Recognition Indian Languages. (ml), (mr), (ta), (te)} [4] King, B., & Abney, S. P. (2013, June). Labeling the Languages of Words in Mixed-Language Documents using 2. Token made of two languages {MIX} and its sub- Weakly Supervised Methods. In HLT-NAACL (pp. 1110- category {MIX Lr Ls } where r and s are the language 1119). of root and suffix respectively [5] A. Das and B. Gamback. (2014). Code-Mixing in So- cial Media Text:The Last Language Identification Frontier? 3. None of the above {O} Traitement Automatique des Langues (TAL): Special Issue For the purpose of training, the datasets were cleaned and on Social Networks and NLP , TAL Volume 54 no 3/2013, tokens annotated as NE or its subcategories and X’s were Pages 41-64 removed. The trained classifier was then applied on testing [6] Nothman, J., Ringland, N., Radford, W., Murphy, T., set to get the prediction for each token. & Curran, J. R. (2013). Learning multilingual named en- tity recognition from Wikipedia. Artificial Intelligence, 194, 151-175. 7. RESULTS In this section, results are reported to proposed exper- iments. To investigate and find the most effective meth- ods as described in upper sections, results are initially ob- tained by development(Training) set and finally, the model was selected using 4-fold cross validation. Thereafter, a 36