=Paper=
{{Paper
|id=Vol-2036/T1-2
|storemode=property
|title=AmritaNLP@PAN-RusProfiling : Author Profiling using Machine Learning Techniques
|pdfUrl=https://ceur-ws.org/Vol-2036/T1-2.pdf
|volume=Vol-2036
|authors=Vivek Vinayan,Naveen J R,Harikrishnan NB,Anand Kumar M,Soman KP
|dblpUrl=https://dblp.org/rec/conf/fire/VinayanRBMP17
}}
==AmritaNLP@PAN-RusProfiling : Author Profiling using Machine Learning Techniques==
<pdf width="1500px">https://ceur-ws.org/Vol-2036/T1-2.pdf</pdf>
<pre>
          AmritaNLP@PAN-RusProfiling : Author Profiling using
                    Machine Learning Techniques
                 Vivek Vinayan, Naveen J R, Harikrishnan NB, Anand Kumar M and Soman KP
                                   Center for Computational Engineering and Networking
                                            Amrita University, Coimbatore, India
                    vivekvinayan82@gmail.com,naveenaksharam@gmail.com,harikrishnannb07@gmail.com
                                   m_anandkumar@cb.amrita.edu,kp_soman@amrita.edu

ABSTRACT                                                                     In "Author profiling" [9] sharedtask, we examine the style of an
This paper illustrates work done on "Gender Identification in Rus-        individual author and thus distinguish between classes of authors by
sian texts (RusProfiling)" shared task, hosted by PAN in conjunction      studying their sociolect aspects. In an even broader manner it helps
with FIRE 2017. The task is to predict the author’s gender, based on      in predicting an author’s demographic, personality, education and
the Twitter data corpus which is in Russian.                              socio-networks through classification of texts into classes, based
   We will give a brief introduction to the task at hand, elaborate       on the stylistic choices of the author.
on the data-set provided by the competition organizers, discuss           With this paper on RusProfiling sharedtask, we focus on cross-genre
various feature selection methods, provided experimental analysis         gender identification in Russian texts[6], which is becoming a part
that we followed for feature representation and show comparative          of, one of the most upcoming trending task in NLP domain, under
outcomes of different classifiers that we used for validation.            "Author Profiling"[7, 8].
We submitted a total of 3 models and their respective prediction for      In this task we have Twitter as training data corpus and as test
each test data-set with slightly different pre-processing technique       data corpus we have dataset from multiple social media domain
based upon the test corpus content. As each of the test corpus were       platforms like Twitter, Facebook, online reviews (where texts are
sourced from various platforms, this made it challenging to stick to      describing images, or letters to a friend), product and services. The
one representation alone.                                                 focus with this task is on gender profiling in social media and the
As per the global ranking published for the shared task[6] our team       main interest is in the everyday language and on how the basic
secured 2nd position overall (Concatenating all Data-set) and our         social and personality skills reflects on their writing [3, 5].
3rd submission model performed the best among the 3 submission               The main challenge in this task is the language itself, as it is not
models from the overall test data corpus.                                 a native, thus we used certain pre-processing methods and built
Further under extended work we discuss in brief how hyper param-          our baseline representation on which we implemented classical
eter tuning of certain attributes extend our validation accuracy by       machine learning algorithms for this text classification task.
6% from baseline.
                                                                          2   CORPUS
KEYWORDS                                                                  The Corpus for the training data was mainly sourced from social
                                                                          media website Twitter and the labels were annotated for each of the
Author Profiling, Russian Language,
                                                                          document with the author gender "male" or "female". The training
Text Classification, Semi-supervised Classifiers
                                                                          corpus is a collection of 600 data file in XML format which consists
ACM Reference Format:                                                     of exactly half female and half male genre documents, the file name
Vivek Vinayan, Naveen J R, Harikrishnan NB, Anand Kumar M and Soman       are annotated by their associated gender label in a separate file
KP. 2017. AmritaNLP@PAN-RusProfiling : Author Profiling using Machine     called "truth" which is in text format.
Learning Techniques. In .
                                                                                        Table 1: Training Dataset Statistics

1   INTRODUCTION                                                                                  Training Dataset
The Internet, it is a vast platform where anyone can have access                          Total number of documents            600
to myriads of information, from online news media articles to var-                      Total number of male documents         300
ious social media platforms, from personal blogs to personalized                       Total number of female documents        300
websites, all this literally at the end of our fingertips, and in this
present age, life is becoming unimaginable without it. With the              A cursory analysis of the training corpus reviled that each train-
availability to all this resources, people are writing and share infor-   ing data file has a combination of different tags and hyperlinks,
mation more avidly over the internet than ever before, and it also        further the documents varied in count of content words i.e one
provides a certain degree of anonymity while doing so. Access to          document went from no text to others over 3000 plus words in a
such multitudinous information brings in certain set of problems          single document. Few of the files had mixed data of Russian and
like theft of identity/content and plagiarism to name a few and this      English, where as few other where completely in English language.
we are trying to address with tasks such as "Author Profiling".              The test corpus is presented in 5 folders varied by the category
                                                                          of different sources. Each set contains different amount of files, the
,,                                                                                                                            Vivek Vinayan

                Table 2: Testing Dataset Statistics
                                                                         Figure 1: Architecture of our model for the Sharedtask
                           Testing Dataset
          DS1- Offline Texts (picture description etc)   370
                        DS2 - Facebook                   228
                         DS3 - Twitter                   400
                    DS4 - Online Reviews                 776
               DS5 - Gender Imitation Corpus              94
                Total number of documents                1868


Table 3: Vocabulary Size based on min_df and n-gram range

     (min,
                              VOCABULARY SIZE
     n-gram)
               Training
                         DS1    DS2    DS3    DS4    DS5
               Dataset                                                 3.1     Feature Selection
      (1,1)     183119  96544 52098 16077 9184 6797                    The feature selection was a process in which we started by building
      (1,2)     852380 390683 234284 80991 39732 25139                 a baseline model [1] and improved on the accuracy of the model
      (1,3)    1698948 746445 451658 164853 75790 46435                with step by step empirical procedure of combining and modifying
      (1,4)    2583433 1114714 672974 252602 111942 68167              the existing feature representation [10].
      (2,1)     45223   22482 17278 5989      3438 2151                     • Count Based Matrix :
      (2,2)     91646   38869 29643 13529 6528 3562                           The 1st approach from the dataset was to form a simple
      (2,3)     106192  43324 31965 16190 7291 3898                           count based Term Document(TD) and Term Frequency In-
      (2,4)     109234  44206 32232 17074 7565 3987                           verse Document Frequency (TFIDF) matrix which became
      (3,1)     28365   13188   9973   3801   2190 1233                       the baseline for our accuracy and further went with adding
      (3,2)     106192  19702 15086 7202      3381 1767                       general features to previous representation.
      (3,3)     52940   20923 15708 8098      3526 1864
      (3,4)     53646   21065 15745 8381      3545 1886                      • Feature Extraction :
      (4,1)     20677    9228   6910   2805   1594   821                       With the knowledge of the social media network "Twitter",
      (4,2)     32403   12953   9931   4903   2305 1101                        we essentially narrowed down our focus on search for fea-
      (4,3)     34876   13551 10226 5380      2376 1139                        tures to tags, like ’@’ which is mainly used to address peo-
      (4,4)     35184   13604 10237 5512      2383 1145                        ple/gathering and hash tag ’#’ which is based on the context
                                                                               or the image of the adjoining content. Moving on, we found
                                                                               that URLs were being used widely across most of the dataset
                                                                               which linked to various internet sources, so we then incorpo-
                                                                               rated these as a feature to the earlier feature representation
count of documents for each category varies from 96 to 776 files
                                                                               which proved to show slight improvement on all of the clas-
each. On further inspection the text format provided in each folders
                                                                               sification algorithms, captured below in Figure 3-4.
apart from the 3rd folder is different when comparing with the
training corpus, namely offline texts, Facebook, Twitter, product
                                                                             • Data Normalization :
and online reviews and gender imitation corpus in order of their
                                                                               On further analysis we found that, individual URL’s in itself
folder number respectively as shown in Table 1-2.
                                                                               as a feature seemed fruitless, thus only considering the hy-
   We have also taken the statistical data of the complete vocab-
                                                                               perlink itself, we focused on normalizing these across the
ulary size that we gained from grid search of attributes namely
                                                                               dataset and went with the count of the URL and those of the
n-gram_range and min_df count. In each combination their respec-
                                                                               tag’s as feature to represent a document. It proved to increase
tive corpus size is found, and the statistics are tabulated in Table
                                                                               the accuracy little more, This further led to normalizing of
3.
                                                                               various emoticons represented by a keyword and various
                                                                               other punctuation like the exclamation mark ’!’, period ’.’ and
3     METHODOLOGY                                                              hyphen ’-’ which occurred multiple time or in continuous
The Figure 1 gives a rudimentary picture of the architecture that              repetition were converted to a single instance of each.
we have implemented for our 3 submissions, in all of these models            • Word Average: :
we mainly focused on data pre-processing methods to incorporate                As we were not familiar with the language, we considered
various features and build upon each one of them to improve the                the average word length as the total number of character
feature representation. We started from a simple count based model,            per document to the total number of feature instance in that
the same methods are discussed next.                                           document and appended that list as an average word length
AmritaNLP@PAN-RusProfiling                                                                                                               ,,

                                                                         Table 4: Cross-validation Result with Different Classifier
                    Figure 2: Pre-Processing
                                                                                  Count             SVM
    '                                                          $            SN
                                                                                  Matrix
                                                                                             LR
                                                                                                    linear
                                                                                                             DT      Adaboost     RF

      Training Data:                                                         1     TD       63.33    79.66   74.00    83.00      82.66
                                                                             1    TFIDF     70.33    72.50   70.00    83.16      81.83
      @BorisVasilevski3 главных вопроса для
                                                                             2     TD       66.70    81.83   75.00    85.16      84.16
      постановки целей
                                                                             2    TFIDF     72.16    75.83   75.33    80.83      81.66
      https://t.co/mDjHguJBaK@timarina2 привет!!!
                                                                             3     TD       61.70    81.83   74.83    83.99      84.99
      @RinatDecorator1 Ринат, как ты?Есть такие
                                                                             3    TFIDF     72.80    78.00   68.16    82.49      80.83
      слезы, которые надо выплакать обязательно...
                                                                             4     TD       66.70    81.00   74.10    85.66      82.99
      В любое время дня и ночи...
                                                                             4    TFIDF     74.00    78.00   68.00    81.66      82.83
      Чтобы в нутри перегорело...
      https://t.co/VD9sHFX0nE@BorisVasilevski,                               5     TD       70.00    79.83   74.49    85.33      83.66
      а точку...)))                                                          5    TFIDF     74.00    77.50   67.16    82.83      81.66
      @timarina2@tunukbek3@tunukbek3@fadin_ivan
      @timarina2@70e8afbc3f2349418 уже есть | Красота
                                                                        (DT), Adaboost and Random forest(RF)[2] the result are as displayed
      | Новости | VOGUE http://t.co/cfoSlqQHvRСлов
                                                                        in Table 4.
      нет.... http://t.co/E2Cy5AJcd1@ksorbs Nor when I
                                                                        The following are the feature we considered one step at a time
      was not at this game and did not see the game, now                and consecutively added the next feature to the previous set as
                                                                        mentioned below:
      Processing Data:
      @ borisvasilevski3 главных вопроса для
      постановки целей https привет. @ rinatdecorator1
      Ринат, как ты?Есть такие слезы, которые надо
      выплакать обязательно.
      В любое время дня и ночи.
      Чтобы в нутри перегорело. https а точку. ))) @
      timarina2 @ tunukbek3 @ tunukbek3 @ fadin_ivan
      @ timarina2 @ 70e8afbc3f2349418 уже есть |
      Красота | Новости | vogue https нет. https nor when
      i was not at this game and did not see the game, now
    &                                                          %


                                                                                       Figure 3: TD Classifier Accuracy
       per document making it an independent feature. This is to
       accommodate for the fact that the average vocabulary word
       length that gender used can also be taken as a discriminative
       feature between the 2 classes.

4   EXPERIMENT AND DISCUSSIONS
As a part of experimental analysis we manually ran over few ran-
dom training documents based on the individual sizes of the file to
gather a glimpse of the overall change in data, then ran snippets
on these training set data to see the scale of, improvement of, accu-
racy with the above considered parameters. Thus to distinguish for
better feature representation for the classification.
   After going through various transitions, the selected certain fea-
tures were extracted and then used as a part of pre-processing of
the entire training corpus these features were individually added
one set at a time to show the increase or decrease in their accu-
racy corresponding to various classifiers by cross validation with                   Figure 4: TFIDF Classifier Accuracy
different classical ML classifiers, namely Logistic regression (LR),
Support Vector Machine (SVM) using linear kernel, Decision tree
,,                                                                                                                               Vivek Vinayan


     (1) A simple count based matrix is taken to achieve baseline          reviews corpus".
         accuracy, for this we have considered both TD and TFIDF
         matrix representation from which we set a base-line accuracy
         of 80.5% ( We randomly initialized few attributes like n-         7   EXTENDED WORK
         gram_range = 2 , min_df = 3 and used a linear SVM classifier      In our earlier experiments we randomly initialized our attributes
         to get the baseline).                                             like max feature length, n_gram and min_df with 10000, 2 and 3
     (2) Count of ’http’ and ’https’ are taken and converted to a single   respectively. As a motive to increase validation accuracy we per-
         key word ’https’ as this will help in adding feature to see       formed a grid search for the hyper-parameter namely word count,
         the usage of URLs between the 2 class distinguishing which        n_gram and min_df based values with the TFIDF model, where we
         gender base might have used more number of hyperlinks             considered the following range of data values for each:
         within their tweets.
     (3) Count of ’#’ tags was further attributed to the previous rep-     Word count : 10000 - 50000
         resentation.                                                      ngram-range : 2 - 6
     (4) Replaced emoticons with keyword.                                  Min-df : 1 - 4
     (5) Took the average word length in a document i.e count of
         character to number of feature instances as the language,            After applying grid search we pushed the baseline accuracy to
         this we chose as a preferred method.                              82.5% when initializing max_feature with 10,000, n_gram with 2
                                                                           and min_df with 1 and applying a linear SVM classifier. We further
5     FEATURE REPRESENTATIONS MODELS                                       pushed our validation to 86.49% by applying Adaboost classifier.
                                                                           Over all we found that the trend of accuracy of TD feature represen-
We submitted a total of 3 models/run’s and for each individual run
                                                                           tation model decreased with increase in all the attributes, and the
the following pre-processing method have been followed:
                                                                           accuracy of TFIDF feature representation model increased but satu-
      • Submission 1 : We have considered feature representation 2,        rated after n_gram value exceeds 6 and the min_df value exceeds 4,
        3, 4, 5 and also the normalization of ’@’ followed by content      the same is show in Figure 5 where the best of the attributes, feature
        tags to simple key word( Splitting tags from their context         combination were taken for each TD and TFIDF representation.
        otherwise to preserve the word content in particular did
        not show much difference in validation accuracy), and used
        SVM classifier for classification. Based on learned model
        from training corpus the prediction for the test corpus’s
        were taken.
      • Submission 2 : The same feature representation as the 1st
        run was considered, but we used a different classifier, we
        took Adaboost based training model and the prediction for
        the test corpus’s were taken
      • Submission 3 : In this run, we considered mostly with re-
        gard to the other test datasets 1,2,4 and 5 where the content
        are in longer and in paragraph form rather than the shorter
        version and there was less to no use of tags and or hyperlinks.
        Thus to normalize this we disregarded the above used tags
        and reduced any extended repeat of punctuation’s to a single
        count(e.g:’...’ is shortened to a single ’.’)                      Figure 5: Hyper-parameter tuning for feature representa-
        A sample of this is shown in Figure 2.                             tion

6     RESULTS
As per the global ranking published for the shared task by the             8   CONCLUSION & FUTURE WORK
organizers[6] our team secured 2nd position overall (Concatenating         The challenge in this shared task we faced was the fact that we
all dataset). From the rankings our 3rd submission performed the           were working on a language corpus that is non native to us, thus we
best compared to our team’s previous 2 submissions by a margin             mainly focused on pre-processing and normalizing the data corpus
of 1%, 2% respectively whereas from the leading team we trailed            to get improved feature representation. We built from a basic count
by margin of 6%, this was w.r.t to the facts that we mentioned in          representation and incorporated simple modification on iterating
our submission 3 and also we got better validation accuracy for the        feature representation and observed the various accuracy changes
submission Model 3 for datasets 1,2,4 and 5.                               involved with those features. Based on the experimental analysis
Individually, submission 3 gaining 2nd best accuracy in "off line          and further discussion on optimizing of the various attributes in
texts (picture descriptions, letter to a friend etc.)from RusProfiling     the extended work, we could make an inference that the baseline
corpus" whereas the submission 1 gained our team 2nd place for             can further be increased which could better improve the prediction,
"gender imitation corpus" and 3rd in "product and service online           fetching us better gender identification model.
AmritaNLP@PAN-RusProfiling                                                              ,,


   As a future study we considered making various embedded repre-
sentation for the Russian corpus and use deep learning techniques
for categorizing author gender [11]. As these methods require more
number of training instances we are considering including certain
additional corpus provided by PAN [4] for this task and also con-
sider certain portions of labelled test dataset based on the variety
of the source that they are taken from.

REFERENCES
 [1] H.B. Barathi Ganesh, M. Anand Kumar, and K.P. Soman. 2016. Statistical semantics
     in context space: Amrita CEN@author profiling. CEUR Workshop Proceedings
     1609 (2016), 881–889.
 [2] H.B. Barathi Ganesh, U. Reshma, and M. Anand Kumar. 2015. Author identifica-
     tion based on word distribution in word space. 2015 International Conference on
     Advances in Computing, Communications and Informatics, ICACCI 2015 (2015),
     1519–1523. https://doi.org/10.1109/ICACCI.2015.7275828
 [3] Fabio Celli, Bruno Lepri, Joan-Isaac Biel, Daniel Gatica-Perez, Giuseppe Ric-
     cardi, and Fabio Pianesi. 2014. The Workshop on Computational Personality
     Recognition. (2014).
 [4] Tatiana Litvinova, Olga Litvinlova, Olga Zagorovskaya, Pavel Seredin, Aleksandr
     Sboev, and Olga Romanchenko. 2016. " Ruspersonality": A Russian corpus for
     authorship profiling and deception detection. In Intelligence, Social Media and
     Web (ISMW FRUCT), 2016 International FRUCT Conference on. IEEE, 1–7.
 [5] Tatiana Litvinova and Olga Litvinova. 2016. Authorship Profiling in Russian-
     Language Texts. In Proceedings of 13th International Conference on Statistical
     Analysis of Textual Data (JADT 2016), University Nice Sophia Antipolis, Nice.
     793–798.
 [6] Tatiana Litvinova, Francisco Rangel, Paolo Rosso, Pavel Seredin, and Olga Litvi-
     nova. 2017. Overview of the RUSProfiling PAN at FIRE Track on Cross-genre
     Gender Identification in Russian. In Notebook Papers of FIRE 2017, FIRE-2017,
     Bangalore, India, December 8-10, CEUR Workshop Proceedings. CEUR-WS.org
 [7] Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Gia-
     como Inches. 2013. Overview of the author profiling task at PAN 2013. Notebook
     Papers of CLEF (2013), 23–26.
 [8] Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pot-
     thast, and Benno Stein. 2016. Overview of the 4th author profiling task at PAN
     2016: cross-genre evaluations. Working Notes Papers of the CLEF (2016).
 [9] Francisco Manuel Rangel Pardo, Fabio Celli, Paolo Rosso, Martin Potthast, Benno
     Stein, and Walter Daelemans. 2015. Overview of the 3rd Author Profiling Task
     at PAN 2015. In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers.
     1–8.
[10] Aleksandr Sboev, Tatiana Litvinova, Dmitry Gudovskikh, Roman Rybka, and
     Ivan Moloshnikov. 2016. Machine Learning Models of Text Categorization by
     Author Gender Using Topic-independent Features. Procedia Computer Science
     101 (2016), 135–142.
[11] Aleksandr Sboev, Tatiana Litvinova, Irina Voronina, Dmitry Gudovskikh, and
     Roman Rybka. 2016. Deep Learning Network Models to Categorize Texts Ac-
     cording to Author’s Gender and to Identify Text Sentiment. In Computational
     Science and Computational Intelligence (CSCI), 2016 International Conference on.
     IEEE, 1101–1106.

</pre>