=Paper=
{{Paper
|id=Vol-2036/T1-2
|storemode=property
|title=AmritaNLP@PAN-RusProfiling : Author Profiling using Machine Learning Techniques
|pdfUrl=https://ceur-ws.org/Vol-2036/T1-2.pdf
|volume=Vol-2036
|authors=Vivek Vinayan,Naveen J R,Harikrishnan NB,Anand Kumar M,Soman KP
|dblpUrl=https://dblp.org/rec/conf/fire/VinayanRBMP17
}}
==AmritaNLP@PAN-RusProfiling : Author Profiling using Machine Learning Techniques==
AmritaNLP@PAN-RusProfiling : Author Profiling using Machine Learning Techniques Vivek Vinayan, Naveen J R, Harikrishnan NB, Anand Kumar M and Soman KP Center for Computational Engineering and Networking Amrita University, Coimbatore, India vivekvinayan82@gmail.com,naveenaksharam@gmail.com,harikrishnannb07@gmail.com m_anandkumar@cb.amrita.edu,kp_soman@amrita.edu ABSTRACT In "Author profiling" [9] sharedtask, we examine the style of an This paper illustrates work done on "Gender Identification in Rus- individual author and thus distinguish between classes of authors by sian texts (RusProfiling)" shared task, hosted by PAN in conjunction studying their sociolect aspects. In an even broader manner it helps with FIRE 2017. The task is to predict the author’s gender, based on in predicting an author’s demographic, personality, education and the Twitter data corpus which is in Russian. socio-networks through classification of texts into classes, based We will give a brief introduction to the task at hand, elaborate on the stylistic choices of the author. on the data-set provided by the competition organizers, discuss With this paper on RusProfiling sharedtask, we focus on cross-genre various feature selection methods, provided experimental analysis gender identification in Russian texts[6], which is becoming a part that we followed for feature representation and show comparative of, one of the most upcoming trending task in NLP domain, under outcomes of different classifiers that we used for validation. "Author Profiling"[7, 8]. We submitted a total of 3 models and their respective prediction for In this task we have Twitter as training data corpus and as test each test data-set with slightly different pre-processing technique data corpus we have dataset from multiple social media domain based upon the test corpus content. As each of the test corpus were platforms like Twitter, Facebook, online reviews (where texts are sourced from various platforms, this made it challenging to stick to describing images, or letters to a friend), product and services. The one representation alone. focus with this task is on gender profiling in social media and the As per the global ranking published for the shared task[6] our team main interest is in the everyday language and on how the basic secured 2nd position overall (Concatenating all Data-set) and our social and personality skills reflects on their writing [3, 5]. 3rd submission model performed the best among the 3 submission The main challenge in this task is the language itself, as it is not models from the overall test data corpus. a native, thus we used certain pre-processing methods and built Further under extended work we discuss in brief how hyper param- our baseline representation on which we implemented classical eter tuning of certain attributes extend our validation accuracy by machine learning algorithms for this text classification task. 6% from baseline. 2 CORPUS KEYWORDS The Corpus for the training data was mainly sourced from social media website Twitter and the labels were annotated for each of the Author Profiling, Russian Language, document with the author gender "male" or "female". The training Text Classification, Semi-supervised Classifiers corpus is a collection of 600 data file in XML format which consists ACM Reference Format: of exactly half female and half male genre documents, the file name Vivek Vinayan, Naveen J R, Harikrishnan NB, Anand Kumar M and Soman are annotated by their associated gender label in a separate file KP. 2017. AmritaNLP@PAN-RusProfiling : Author Profiling using Machine called "truth" which is in text format. Learning Techniques. In . Table 1: Training Dataset Statistics 1 INTRODUCTION Training Dataset The Internet, it is a vast platform where anyone can have access Total number of documents 600 to myriads of information, from online news media articles to var- Total number of male documents 300 ious social media platforms, from personal blogs to personalized Total number of female documents 300 websites, all this literally at the end of our fingertips, and in this present age, life is becoming unimaginable without it. With the A cursory analysis of the training corpus reviled that each train- availability to all this resources, people are writing and share infor- ing data file has a combination of different tags and hyperlinks, mation more avidly over the internet than ever before, and it also further the documents varied in count of content words i.e one provides a certain degree of anonymity while doing so. Access to document went from no text to others over 3000 plus words in a such multitudinous information brings in certain set of problems single document. Few of the files had mixed data of Russian and like theft of identity/content and plagiarism to name a few and this English, where as few other where completely in English language. we are trying to address with tasks such as "Author Profiling". The test corpus is presented in 5 folders varied by the category of different sources. Each set contains different amount of files, the ,, Vivek Vinayan Table 2: Testing Dataset Statistics Figure 1: Architecture of our model for the Sharedtask Testing Dataset DS1- Offline Texts (picture description etc) 370 DS2 - Facebook 228 DS3 - Twitter 400 DS4 - Online Reviews 776 DS5 - Gender Imitation Corpus 94 Total number of documents 1868 Table 3: Vocabulary Size based on min_df and n-gram range (min, VOCABULARY SIZE n-gram) Training DS1 DS2 DS3 DS4 DS5 Dataset 3.1 Feature Selection (1,1) 183119 96544 52098 16077 9184 6797 The feature selection was a process in which we started by building (1,2) 852380 390683 234284 80991 39732 25139 a baseline model [1] and improved on the accuracy of the model (1,3) 1698948 746445 451658 164853 75790 46435 with step by step empirical procedure of combining and modifying (1,4) 2583433 1114714 672974 252602 111942 68167 the existing feature representation [10]. (2,1) 45223 22482 17278 5989 3438 2151 • Count Based Matrix : (2,2) 91646 38869 29643 13529 6528 3562 The 1st approach from the dataset was to form a simple (2,3) 106192 43324 31965 16190 7291 3898 count based Term Document(TD) and Term Frequency In- (2,4) 109234 44206 32232 17074 7565 3987 verse Document Frequency (TFIDF) matrix which became (3,1) 28365 13188 9973 3801 2190 1233 the baseline for our accuracy and further went with adding (3,2) 106192 19702 15086 7202 3381 1767 general features to previous representation. (3,3) 52940 20923 15708 8098 3526 1864 (3,4) 53646 21065 15745 8381 3545 1886 • Feature Extraction : (4,1) 20677 9228 6910 2805 1594 821 With the knowledge of the social media network "Twitter", (4,2) 32403 12953 9931 4903 2305 1101 we essentially narrowed down our focus on search for fea- (4,3) 34876 13551 10226 5380 2376 1139 tures to tags, like ’@’ which is mainly used to address peo- (4,4) 35184 13604 10237 5512 2383 1145 ple/gathering and hash tag ’#’ which is based on the context or the image of the adjoining content. Moving on, we found that URLs were being used widely across most of the dataset which linked to various internet sources, so we then incorpo- rated these as a feature to the earlier feature representation count of documents for each category varies from 96 to 776 files which proved to show slight improvement on all of the clas- each. On further inspection the text format provided in each folders sification algorithms, captured below in Figure 3-4. apart from the 3rd folder is different when comparing with the training corpus, namely offline texts, Facebook, Twitter, product • Data Normalization : and online reviews and gender imitation corpus in order of their On further analysis we found that, individual URL’s in itself folder number respectively as shown in Table 1-2. as a feature seemed fruitless, thus only considering the hy- We have also taken the statistical data of the complete vocab- perlink itself, we focused on normalizing these across the ulary size that we gained from grid search of attributes namely dataset and went with the count of the URL and those of the n-gram_range and min_df count. In each combination their respec- tag’s as feature to represent a document. It proved to increase tive corpus size is found, and the statistics are tabulated in Table the accuracy little more, This further led to normalizing of 3. various emoticons represented by a keyword and various other punctuation like the exclamation mark ’!’, period ’.’ and 3 METHODOLOGY hyphen ’-’ which occurred multiple time or in continuous The Figure 1 gives a rudimentary picture of the architecture that repetition were converted to a single instance of each. we have implemented for our 3 submissions, in all of these models • Word Average: : we mainly focused on data pre-processing methods to incorporate As we were not familiar with the language, we considered various features and build upon each one of them to improve the the average word length as the total number of character feature representation. We started from a simple count based model, per document to the total number of feature instance in that the same methods are discussed next. document and appended that list as an average word length AmritaNLP@PAN-RusProfiling ,, Table 4: Cross-validation Result with Different Classifier Figure 2: Pre-Processing Count SVM ' $ SN Matrix LR linear DT Adaboost RF Training Data: 1 TD 63.33 79.66 74.00 83.00 82.66 1 TFIDF 70.33 72.50 70.00 83.16 81.83 @BorisVasilevski3 главных вопроса для 2 TD 66.70 81.83 75.00 85.16 84.16 постановки целей 2 TFIDF 72.16 75.83 75.33 80.83 81.66 https://t.co/mDjHguJBaK@timarina2 привет!!! 3 TD 61.70 81.83 74.83 83.99 84.99 @RinatDecorator1 Ринат, как ты?Есть такие 3 TFIDF 72.80 78.00 68.16 82.49 80.83 слезы, которые надо выплакать обязательно... 4 TD 66.70 81.00 74.10 85.66 82.99 В любое время дня и ночи... 4 TFIDF 74.00 78.00 68.00 81.66 82.83 Чтобы в нутри перегорело... https://t.co/VD9sHFX0nE@BorisVasilevski, 5 TD 70.00 79.83 74.49 85.33 83.66 а точку...))) 5 TFIDF 74.00 77.50 67.16 82.83 81.66 @timarina2@tunukbek3@tunukbek3@fadin_ivan @timarina2@70e8afbc3f2349418 уже есть | Красота (DT), Adaboost and Random forest(RF)[2] the result are as displayed | Новости | VOGUE http://t.co/cfoSlqQHvRСлов in Table 4. нет.... http://t.co/E2Cy5AJcd1@ksorbs Nor when I The following are the feature we considered one step at a time was not at this game and did not see the game, now and consecutively added the next feature to the previous set as mentioned below: Processing Data: @ borisvasilevski3 главных вопроса для постановки целей https привет. @ rinatdecorator1 Ринат, как ты?Есть такие слезы, которые надо выплакать обязательно. В любое время дня и ночи. Чтобы в нутри перегорело. https а точку. ))) @ timarina2 @ tunukbek3 @ tunukbek3 @ fadin_ivan @ timarina2 @ 70e8afbc3f2349418 уже есть | Красота | Новости | vogue https нет. https nor when i was not at this game and did not see the game, now & % Figure 3: TD Classifier Accuracy per document making it an independent feature. This is to accommodate for the fact that the average vocabulary word length that gender used can also be taken as a discriminative feature between the 2 classes. 4 EXPERIMENT AND DISCUSSIONS As a part of experimental analysis we manually ran over few ran- dom training documents based on the individual sizes of the file to gather a glimpse of the overall change in data, then ran snippets on these training set data to see the scale of, improvement of, accu- racy with the above considered parameters. Thus to distinguish for better feature representation for the classification. After going through various transitions, the selected certain fea- tures were extracted and then used as a part of pre-processing of the entire training corpus these features were individually added one set at a time to show the increase or decrease in their accu- racy corresponding to various classifiers by cross validation with Figure 4: TFIDF Classifier Accuracy different classical ML classifiers, namely Logistic regression (LR), Support Vector Machine (SVM) using linear kernel, Decision tree ,, Vivek Vinayan (1) A simple count based matrix is taken to achieve baseline reviews corpus". accuracy, for this we have considered both TD and TFIDF matrix representation from which we set a base-line accuracy of 80.5% ( We randomly initialized few attributes like n- 7 EXTENDED WORK gram_range = 2 , min_df = 3 and used a linear SVM classifier In our earlier experiments we randomly initialized our attributes to get the baseline). like max feature length, n_gram and min_df with 10000, 2 and 3 (2) Count of ’http’ and ’https’ are taken and converted to a single respectively. As a motive to increase validation accuracy we per- key word ’https’ as this will help in adding feature to see formed a grid search for the hyper-parameter namely word count, the usage of URLs between the 2 class distinguishing which n_gram and min_df based values with the TFIDF model, where we gender base might have used more number of hyperlinks considered the following range of data values for each: within their tweets. (3) Count of ’#’ tags was further attributed to the previous rep- Word count : 10000 - 50000 resentation. ngram-range : 2 - 6 (4) Replaced emoticons with keyword. Min-df : 1 - 4 (5) Took the average word length in a document i.e count of character to number of feature instances as the language, After applying grid search we pushed the baseline accuracy to this we chose as a preferred method. 82.5% when initializing max_feature with 10,000, n_gram with 2 and min_df with 1 and applying a linear SVM classifier. We further 5 FEATURE REPRESENTATIONS MODELS pushed our validation to 86.49% by applying Adaboost classifier. Over all we found that the trend of accuracy of TD feature represen- We submitted a total of 3 models/run’s and for each individual run tation model decreased with increase in all the attributes, and the the following pre-processing method have been followed: accuracy of TFIDF feature representation model increased but satu- • Submission 1 : We have considered feature representation 2, rated after n_gram value exceeds 6 and the min_df value exceeds 4, 3, 4, 5 and also the normalization of ’@’ followed by content the same is show in Figure 5 where the best of the attributes, feature tags to simple key word( Splitting tags from their context combination were taken for each TD and TFIDF representation. otherwise to preserve the word content in particular did not show much difference in validation accuracy), and used SVM classifier for classification. Based on learned model from training corpus the prediction for the test corpus’s were taken. • Submission 2 : The same feature representation as the 1st run was considered, but we used a different classifier, we took Adaboost based training model and the prediction for the test corpus’s were taken • Submission 3 : In this run, we considered mostly with re- gard to the other test datasets 1,2,4 and 5 where the content are in longer and in paragraph form rather than the shorter version and there was less to no use of tags and or hyperlinks. Thus to normalize this we disregarded the above used tags and reduced any extended repeat of punctuation’s to a single count(e.g:’...’ is shortened to a single ’.’) Figure 5: Hyper-parameter tuning for feature representa- A sample of this is shown in Figure 2. tion 6 RESULTS As per the global ranking published for the shared task by the 8 CONCLUSION & FUTURE WORK organizers[6] our team secured 2nd position overall (Concatenating The challenge in this shared task we faced was the fact that we all dataset). From the rankings our 3rd submission performed the were working on a language corpus that is non native to us, thus we best compared to our team’s previous 2 submissions by a margin mainly focused on pre-processing and normalizing the data corpus of 1%, 2% respectively whereas from the leading team we trailed to get improved feature representation. We built from a basic count by margin of 6%, this was w.r.t to the facts that we mentioned in representation and incorporated simple modification on iterating our submission 3 and also we got better validation accuracy for the feature representation and observed the various accuracy changes submission Model 3 for datasets 1,2,4 and 5. involved with those features. Based on the experimental analysis Individually, submission 3 gaining 2nd best accuracy in "off line and further discussion on optimizing of the various attributes in texts (picture descriptions, letter to a friend etc.)from RusProfiling the extended work, we could make an inference that the baseline corpus" whereas the submission 1 gained our team 2nd place for can further be increased which could better improve the prediction, "gender imitation corpus" and 3rd in "product and service online fetching us better gender identification model. AmritaNLP@PAN-RusProfiling ,, As a future study we considered making various embedded repre- sentation for the Russian corpus and use deep learning techniques for categorizing author gender [11]. As these methods require more number of training instances we are considering including certain additional corpus provided by PAN [4] for this task and also con- sider certain portions of labelled test dataset based on the variety of the source that they are taken from. REFERENCES [1] H.B. Barathi Ganesh, M. Anand Kumar, and K.P. Soman. 2016. Statistical semantics in context space: Amrita CEN@author profiling. CEUR Workshop Proceedings 1609 (2016), 881–889. [2] H.B. Barathi Ganesh, U. Reshma, and M. Anand Kumar. 2015. Author identifica- tion based on word distribution in word space. 2015 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2015 (2015), 1519–1523. https://doi.org/10.1109/ICACCI.2015.7275828 [3] Fabio Celli, Bruno Lepri, Joan-Isaac Biel, Daniel Gatica-Perez, Giuseppe Ric- cardi, and Fabio Pianesi. 2014. The Workshop on Computational Personality Recognition. (2014). [4] Tatiana Litvinova, Olga Litvinlova, Olga Zagorovskaya, Pavel Seredin, Aleksandr Sboev, and Olga Romanchenko. 2016. " Ruspersonality": A Russian corpus for authorship profiling and deception detection. In Intelligence, Social Media and Web (ISMW FRUCT), 2016 International FRUCT Conference on. IEEE, 1–7. [5] Tatiana Litvinova and Olga Litvinova. 2016. Authorship Profiling in Russian- Language Texts. In Proceedings of 13th International Conference on Statistical Analysis of Textual Data (JADT 2016), University Nice Sophia Antipolis, Nice. 793–798. [6] Tatiana Litvinova, Francisco Rangel, Paolo Rosso, Pavel Seredin, and Olga Litvi- nova. 2017. Overview of the RUSProfiling PAN at FIRE Track on Cross-genre Gender Identification in Russian. In Notebook Papers of FIRE 2017, FIRE-2017, Bangalore, India, December 8-10, CEUR Workshop Proceedings. CEUR-WS.org [7] Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Gia- como Inches. 2013. Overview of the author profiling task at PAN 2013. Notebook Papers of CLEF (2013), 23–26. [8] Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pot- thast, and Benno Stein. 2016. Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. Working Notes Papers of the CLEF (2016). [9] Francisco Manuel Rangel Pardo, Fabio Celli, Paolo Rosso, Martin Potthast, Benno Stein, and Walter Daelemans. 2015. Overview of the 3rd Author Profiling Task at PAN 2015. In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers. 1–8. [10] Aleksandr Sboev, Tatiana Litvinova, Dmitry Gudovskikh, Roman Rybka, and Ivan Moloshnikov. 2016. Machine Learning Models of Text Categorization by Author Gender Using Topic-independent Features. Procedia Computer Science 101 (2016), 135–142. [11] Aleksandr Sboev, Tatiana Litvinova, Irina Voronina, Dmitry Gudovskikh, and Roman Rybka. 2016. Deep Learning Network Models to Categorize Texts Ac- cording to Author’s Gender and to Identify Text Sentiment. In Computational Science and Computational Intelligence (CSCI), 2016 International Conference on. IEEE, 1101–1106.