Representation of Target Classes for Text Classification -
                AMRITA_CEN_NLP@RusProfiling PAN 2017
                                  Barathi Ganesh HB, Reshma U, Anand Kumar M and Soman KP
                                 Center for Computational Engineering and Networking
                                                  Amrita University
                                                  Coimbatore, India
        barathiganesh.hb@gmail.com,reshma.anata@gmail.com,m_anandkumar@cb.amrita.edu,soman_kp@amrita.edu

ABSTRACT                                                                  the text representation followed by the feature learning and classi-
This working note describes the system we used while participating        fication. The text representation is a pivotal task that has a direct
in RusProfiling PAN 2017 shared task. The objective of the task is to     proportion with the performance of the system.
identify the gender trait of the author from the author’s text written        By considering the above discussed points, we have experi-
in the Russian Language. Taking this as a binary text classification      mented to develop a representation scheme for target classes, which
problem, we have experimented to develop a representation scheme          is termed in this paper as class vectors. On successive representa-
for target classes (called class vectors) from the texts belonging to     tion of texts using the methods available in Vector Space Models
the corresponding target classes. These class vectors are computed        (Document-Term Matrix and Term Frequency - Inverse Document
from the traditional representation methods available in Vector           Frequency) and Vector Space Models of Semantics (Document-Term
Space Models and Vector Space Models of Semantics. Followed by            Matrix with Singular Value Decomposition and Term Frequency-
the representation, Support Vector Machine with a linear kernel is        Inverse Document Matrix with Singular Value Decomposition), we
used to perform the final classification. For this task, genre indepen-   summed the text vectors in the respective Target classes to form
dent corpus is provided by the RusProfiling PAN 2017 shared task          the class vectors. Later, the variation between the class vectors and
organizers. This proposed model attains almost equal performance          the text vectors are computed through the distance and correla-
across all the genre available in the test corpus.                        tion measures. These measures are taken as the features and fed to
                                                                          the Support Vector Machine with a linear kernel to make the final
                                                                          prediction. The experimented model s given in Figure 1.
CCS CONCEPTS
• Computing methodologies → Natural language process-
ing; Feature selection;

KEYWORDS
Author Profiling, Class Representation, Russian Language, Text
Representation, Text Classification, Vector Space Models, Vector
Space Models of Semantics


1    INTRODUCTION
Prediction of author’s traits (gender, age, native language and per-
sonality traits) from their texts is known as author profiling and its
applications in targeted internet advertising, forensic science and
consumer behaviour analysis induces the researchers1, 2, 3 and
the industries4 to develop a reliable author profiling systems. The
growth of digital text shared in social media (Facebook, Twitter)
feeds the researchers with the required corpus to develop the author
profiling systems and its related shared tasks to build the state of
art systems [4–6].
   Unlike other text classification problems which identify the con-
                                                                                           Figure 1: Experimented Model
text, here more than identifying the context shared by the author,
identifying the style used to share the content by the author is more
relevant [1, 2]. The general text classification works by stacking           The objective of the task is detailed in Section 2; statistics about
                                                                          the corpus are given in Section 3; underlying components for com-
1 http://pan.webis.de/clef17/pan17-web/author-profiling.html
                                                                          puting class vectors, feature learning, and classification method are
2 http://nlp.amrita.edu:8080/INLI/Test.html                               explained in Section 4; cross-validation reports, results reported by
3 https://sites.google.com/site/nlisharedtask/home                        the shared task organizers and observations about the results are
4 https://personality-insights-livedemo.mybluemix.net/
                                                                          detailed in Section 5.
2      TASK DESCRIPTION                                                                 σ represents the singular values (significance of topics) in the de-
Given the training corpus which consists of author’s text tagged                        scending order and V represents the basis vector representation of
with the author’s gender information, the objective is to predict                       the words in the vocabulary. In this work, we have not performed
the gender information for the author’s text available in the test                      the dimensionality reduction.
corpus5 . In the below given equation taд ϵ {male, f emale} and n
represents the total number of author’s text in the corpus.                             4.2      Target Class Representation
                                                                                                                                              t f id f                 t f id f
    train_corpus = {(text 1 , taд1 ), (text 2 , taд1 ), ..., (textn , taдn )} (1)       On the successive computation of D n×m
                                                                                                                           dtm , D
                                                                                                                                   n×m , Un×n and Un×n ,
                                                                                                                                            dtm
                                                                                        the class vectors are computed by summing the respective vectors
                  test_corpus = {text 1 , text 2 , ..., text_n}              (2)        of the author’s text belongs to the classes male and female. It is
                                                                                        given as follows for computing class vectors from VSM,
3      CORPUS STATISTICS                                                                                 n
                                                                                                         Õ
                                                                                                 mal e
For this task, the corpus has been provided by the RusProfiling                                C 1×m   =    vsm_representation [i, :] i f taдi = male  (7)
PAN 2017 shared task organizers [3]. The number of author’s text                                              i=1
(n) in the corpus is given in Table 1. In the given corpus Tweets                                            n
                                                                                              f emal e
                                                                                                             Õ
are taken as author’s text for training (Train), offline texts (picture                     C 1×m        =         vsm_representation [i, :] i f taдi = f emale               (8)
descriptions, letter to a friend etc.) from RusPersonality Corpus                                            i=1
                                                                                                                            n                     o
taken for Test1, comments from Facebook are taken for Test2,                                                                   dtm
                                                                                                      vsm_representation ϵ D n×m
                                                                                                                                         t f id f
                                                                                                                                     , D n×m          (9)
tweets from Twitter taken for Test3, author’s text from product
and service online reviews taken for Test4 and author’s text from                       Similarly, class vectors from VSMs are computed as follows,
Gender imitation corpus taken for Test5 (women imitating men                                          Õn
                                                                                             mal e
and the other way around).                                                                 C 1×n    =     vsms_representation [i, :] i f taдi = male (10)
                                                                                                           i=1
                                                                                                           n
4      REPRESENTATION                                                                       f emal e
                                                                                                           Õ
                                                                                         C 1×n         =         vsms_representation [i, :] i f taдi = f emale (11)
Text representation is the task of transforming the unstructured                                           i=1
texts into its equivalent numerical representation. On successive                                                               n
                                                                                                                                  dtm    t f id f
                                                                                                                                                  o
representation, further mathematical computation will be applied                                           vsms_representation ϵ Un×n , Un×n                              (12)
to it.
                                                                                        4.3      Feature Learning
4.1       Author Representation                                                         The variation between the classes and the author’s texts are com-
   4.1.1 Vector Space Models. Document-Term Matrix (DTM) is                             puted by measuring the distance and correlation between the class
a basic representation method, which accounts the count of the                          vectors (male and female) and the vector representation of the au-
unique word’s present in the document [7]. The reweighing scheme                        thor’s text. We have considered correlation, cosine distance and
introduced along with the DTM to handle the uninformative words                         Euclidean distance for measuring variation. This is given as follows,
                                                                                                                                                             
through the inverse document frequency in Term Frequency-Inverse                          Fn×6 = f eature_learn representation, C mal e , C f emal e            (13)
Document Frequency Matrix (TF-IDF) [7]. These both methods are                                                      n                                       o
from Vector Space Models (VSM) and represented as follows,                                                             dtm        t f id f  dtm    t f id f
                                                                                                 representation ϵ D n×m       , D n×m , Un×n    , Un×n          (14)
                  dtm
                D n×m = dtm(train_corpus | test_corpus)                      (3)                                         n                  o
                                                                                                              C mal e ϵ C 1×mmal e mal e
                                                                                                                                    , C 1×n                     (15)
                t f id f
              D n×m = t f id f (train_corpus | test_corpus)    (4)                                                     n
                                                                                                                           f emal e f emal e
                                                                                                                                               o
                                                                                                         C f emal e ϵ C 1×m , C 1×n                             (16)
In the above equation, n represents the number of author’s text
and m represents the number of unique words in the vocabulary.
Unique words from the train and test corpus are used to build the
common vocabulary.
                                                                                        5     EXPERIMENTS AND OBSERVATIONS
                                                                                        From the given corpus, the author’s texts are represented as the
   4.1.2 Vector Space Models of Semantics. Singular Value Decom-                        matrix as shown in Section 4.1. For DTM and TF-IDF represen-
position applied to the matrix from VSM to get the semantic repre-                      tation, the CountVectorizer6 and TfidfVectorizer7 modules from
sentation of the author’s text. Applying matrix factorization on top                    scikit-learn python library are used. SVD applied on this computed
of the matrix from the VSM is known as Vector Space Models of                           matrices and the basis matrices (U ) alone are kept for further pro-
Semantics (VSMs) [7]. This is represented as,                                           cessing. SVD performed by using the numpy python library8 .
                                                   
                    dtm
                 Un×n            T
                           Σn×m Vm×m         dtm
                                     = svd D n×m                  (5)                      As given in Section 4.2, the class vectors are computed from the
                                                                                      author’s text vectors. As given in Section 4.3, the feature matrix
                  t f id f       T           t f id f
                Un×n Σn×m Vm×m       = svd D n×m                  (6)                   6 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.
                                                                                        CountVectorizer.html
In the above equation, U represents the basis vector representation                     7 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.

of the author’s text and it is used while computing class vectors,                      TfidfVectorizer.html
                                                                                        8 https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.linalg.
5 http://en.rusprofilinglab.ru/rusprofiling-at-pan/                                     svd.html
                                                                                    2
                                      Corpus        Tweets   Offline    Facebook       Tweets        Reviews        Imitation
                                      Total #
                                                     600       370          228          400            776               94
                                      authors
                                                                Table 1: Corpus Statistics


                                       Corpus/      Train
                                                             Offline    Facebook      Tweets        Reviews        Imitation
                                        Run          CV
                                        Run1         0.69     0.45         0.49       0.45             0.50             0.45
                                        Run2         1.0      0.49         0.54       0.50             0.50             0.50
                                        Run3         1.0      0.47         0.51       0.46             0.50             0.40
                                        Run4         1.0      0.50         0.52       0.50             0.52             0.50
                                        Run5         0.66     0.50         0.49       0.50             0.50             0.50
                                                                       Table 2: Results


computed by measuring the variation between the author’s text                      Model and Vectors Space Models of Semantics. Class vectors are
vectors and class vectors. The scipy python library used to compute                computed from the author’s text vectors and variation between the
the cosine distance, Euclidean distance, and correlation measures.                 class vector and the author’s text vectors are taken as the features
   The computed feature matrix along with the target classes fed to                to perform the Support Vector Machine based classification. The
SVM with a linear kernel to perform the final classification. SVM                  preliminary results showed that class vectors based classification
from the scikit-learn python library9 used for classification. In order            improve the accuracy by nearly 4% for the final concatenated test
to observe the training performance, we have computed the 10-fold                  corpus. There has been a large performance variation between the
cross validation score and given in Table 2 (Train CV). This is given              cross-validation score and the score against test corpus. Hence
as follows,                                                                        our future work will be focused more on reducing this margin
                          correctly predicted short texts                          and computing the class vectors though distributed representation
             Accuracy =                                            (17)            methods.
                                total # short texts
                                  Í10
                                       Accuracyi                                   REFERENCES
                     T rain CV = i=1                               (18)
                                         10                                        [1] Barathi Ganesh H. B., M. Anand Kumar, and K. P. Soman. 2016. Statistical Semantics
Similar to the training corpus, the feature matrix computed for the                    in Context Space : Amrita_CEN@Author Profiling. (2016), 881–889.
five test corpus given by the shared task organizers and prediction                [2] Barathi Ganesh H. B., M. Anand Kumar, and K. P. Soman. 2017. Vector Space
                                                                                       Model as Cognitive Space for Text Classification. CoRR abs/1708.06068 (2017).
for the author’s texts are found by using the model built in the                       arXiv:1708.06068 http://arxiv.org/abs/1708.06068
training period. We have totally submitted the five runs and those                 [3] Tatiana Litvinova, Francisco Rangel, Paolo Rosso, Pavel Seredin, and Olga Litvi-
                                                                                       nova. [n. d.]. Overview of the RUSProfiling PAN at FIRE Track on Cross-genre
details are given below,                                                               Gender Identification in Russian. In Notebook Papers of FIRE 2017, FIRE-2017, Ban-
     • Run1: TFIDF -> Class Vectors -> Feature Learning -> Classi-                     galore, India, December 8-10, CEUR Workshop Proceedings.
                                                                                   [4] Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christo-
        fication                                                                       pher Hamill, Diane Napolitano, and Yao Qian. 2017. A Report on the 2017 Native
     • Run2: DTM -> Class Vectors -> Feature Learning -> Classifi-                     Language Identification Shared Task. In Proceedings of the 12th Workshop on Inno-
        cation                                                                         vative Use of NLP for Building Educational Applications. 62–75.
                                                                                   [5] Francisco Rangel, Paolo Rosso, Martin Potthast, and Benno Stein. 2017. Overview
     • Run3: TFIDF -> SVD -> Class Vectors -> Feature Learning ->                      of the 5th author profiling task at pan 2017: Gender and language variety identifi-
        Classification                                                                 cation in twitter. Working Notes Papers of the CLEF (2017).
                                                                                   [6] Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Potthast,
     • Run4: DTM -> SVD -> Class Vectors -> Feature Learning ->                        and Benno Stein. 2016. Overview of the 4th author profiling task at PAN 2016:
        Classification                                                                 cross-genre evaluations. Working Notes Papers of the CLEF (2016).
     • Run5: DTM -> Classification                                                 [7] Peter D Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space
                                                                                       models of semantics. Journal of artificial intelligence research 37 (2010), 141–188.
The results reported by the shared task organizers for the submitted
five runs are given in Table 2. Out of five runs, Run2 performed
better and attained 51.45% as the accuracy measure for the concate-
nated test corpus. Run5 attained 47.06% as accuracy measure for
the concatenated test corpus and this ensures that class embedding
enhanced the accuracy by 4%.

6    CONCLUSION
The given train corpus and the test corpus are represented as the
author’s text vectors by using methods available in Vector Space
9 http://scikit-learn.org/stable/modules/svm.html

                                                                              3