Representation of Target Classes for Text Classification - AMRITA_CEN_NLP@RusProfiling PAN 2017 Barathi Ganesh HB, Reshma U, Anand Kumar M and Soman KP Center for Computational Engineering and Networking Amrita University Coimbatore, India barathiganesh.hb@gmail.com,reshma.anata@gmail.com,m_anandkumar@cb.amrita.edu,soman_kp@amrita.edu ABSTRACT the text representation followed by the feature learning and classi- This working note describes the system we used while participating fication. The text representation is a pivotal task that has a direct in RusProfiling PAN 2017 shared task. The objective of the task is to proportion with the performance of the system. identify the gender trait of the author from the author’s text written By considering the above discussed points, we have experi- in the Russian Language. Taking this as a binary text classification mented to develop a representation scheme for target classes, which problem, we have experimented to develop a representation scheme is termed in this paper as class vectors. On successive representa- for target classes (called class vectors) from the texts belonging to tion of texts using the methods available in Vector Space Models the corresponding target classes. These class vectors are computed (Document-Term Matrix and Term Frequency - Inverse Document from the traditional representation methods available in Vector Frequency) and Vector Space Models of Semantics (Document-Term Space Models and Vector Space Models of Semantics. Followed by Matrix with Singular Value Decomposition and Term Frequency- the representation, Support Vector Machine with a linear kernel is Inverse Document Matrix with Singular Value Decomposition), we used to perform the final classification. For this task, genre indepen- summed the text vectors in the respective Target classes to form dent corpus is provided by the RusProfiling PAN 2017 shared task the class vectors. Later, the variation between the class vectors and organizers. This proposed model attains almost equal performance the text vectors are computed through the distance and correla- across all the genre available in the test corpus. tion measures. These measures are taken as the features and fed to the Support Vector Machine with a linear kernel to make the final prediction. The experimented model s given in Figure 1. CCS CONCEPTS • Computing methodologies → Natural language process- ing; Feature selection; KEYWORDS Author Profiling, Class Representation, Russian Language, Text Representation, Text Classification, Vector Space Models, Vector Space Models of Semantics 1 INTRODUCTION Prediction of author’s traits (gender, age, native language and per- sonality traits) from their texts is known as author profiling and its applications in targeted internet advertising, forensic science and consumer behaviour analysis induces the researchers1, 2, 3 and the industries4 to develop a reliable author profiling systems. The growth of digital text shared in social media (Facebook, Twitter) feeds the researchers with the required corpus to develop the author profiling systems and its related shared tasks to build the state of art systems [4–6]. Unlike other text classification problems which identify the con- Figure 1: Experimented Model text, here more than identifying the context shared by the author, identifying the style used to share the content by the author is more relevant [1, 2]. The general text classification works by stacking The objective of the task is detailed in Section 2; statistics about the corpus are given in Section 3; underlying components for com- 1 http://pan.webis.de/clef17/pan17-web/author-profiling.html puting class vectors, feature learning, and classification method are 2 http://nlp.amrita.edu:8080/INLI/Test.html explained in Section 4; cross-validation reports, results reported by 3 https://sites.google.com/site/nlisharedtask/home the shared task organizers and observations about the results are 4 https://personality-insights-livedemo.mybluemix.net/ detailed in Section 5. 2 TASK DESCRIPTION σ represents the singular values (significance of topics) in the de- Given the training corpus which consists of author’s text tagged scending order and V represents the basis vector representation of with the author’s gender information, the objective is to predict the words in the vocabulary. In this work, we have not performed the gender information for the author’s text available in the test the dimensionality reduction. corpus5 . In the below given equation taд ϵ {male, f emale} and n represents the total number of author’s text in the corpus. 4.2 Target Class Representation t f id f t f id f train_corpus = {(text 1 , taд1 ), (text 2 , taд1 ), ..., (textn , taдn )} (1) On the successive computation of D n×m dtm , D n×m , Un×n and Un×n , dtm the class vectors are computed by summing the respective vectors test_corpus = {text 1 , text 2 , ..., text_n} (2) of the author’s text belongs to the classes male and female. It is given as follows for computing class vectors from VSM, 3 CORPUS STATISTICS n Õ mal e For this task, the corpus has been provided by the RusProfiling C 1×m = vsm_representation [i, :] i f taдi = male (7) PAN 2017 shared task organizers [3]. The number of author’s text i=1 (n) in the corpus is given in Table 1. In the given corpus Tweets n f emal e Õ are taken as author’s text for training (Train), offline texts (picture C 1×m = vsm_representation [i, :] i f taдi = f emale (8) descriptions, letter to a friend etc.) from RusPersonality Corpus i=1 n o taken for Test1, comments from Facebook are taken for Test2, dtm vsm_representation ϵ D n×m t f id f , D n×m (9) tweets from Twitter taken for Test3, author’s text from product and service online reviews taken for Test4 and author’s text from Similarly, class vectors from VSMs are computed as follows, Gender imitation corpus taken for Test5 (women imitating men Õn mal e and the other way around). C 1×n = vsms_representation [i, :] i f taдi = male (10) i=1 n 4 REPRESENTATION f emal e Õ C 1×n = vsms_representation [i, :] i f taдi = f emale (11) Text representation is the task of transforming the unstructured i=1 texts into its equivalent numerical representation. On successive n dtm t f id f o representation, further mathematical computation will be applied vsms_representation ϵ Un×n , Un×n (12) to it. 4.3 Feature Learning 4.1 Author Representation The variation between the classes and the author’s texts are com- 4.1.1 Vector Space Models. Document-Term Matrix (DTM) is puted by measuring the distance and correlation between the class a basic representation method, which accounts the count of the vectors (male and female) and the vector representation of the au- unique word’s present in the document [7]. The reweighing scheme thor’s text. We have considered correlation, cosine distance and introduced along with the DTM to handle the uninformative words Euclidean distance for measuring variation. This is given as follows,   through the inverse document frequency in Term Frequency-Inverse Fn×6 = f eature_learn representation, C mal e , C f emal e (13) Document Frequency Matrix (TF-IDF) [7]. These both methods are n o from Vector Space Models (VSM) and represented as follows, dtm t f id f dtm t f id f representation ϵ D n×m , D n×m , Un×n , Un×n (14) dtm D n×m = dtm(train_corpus | test_corpus) (3) n o C mal e ϵ C 1×mmal e mal e , C 1×n (15) t f id f D n×m = t f id f (train_corpus | test_corpus) (4) n f emal e f emal e o C f emal e ϵ C 1×m , C 1×n (16) In the above equation, n represents the number of author’s text and m represents the number of unique words in the vocabulary. Unique words from the train and test corpus are used to build the common vocabulary. 5 EXPERIMENTS AND OBSERVATIONS From the given corpus, the author’s texts are represented as the 4.1.2 Vector Space Models of Semantics. Singular Value Decom- matrix as shown in Section 4.1. For DTM and TF-IDF represen- position applied to the matrix from VSM to get the semantic repre- tation, the CountVectorizer6 and TfidfVectorizer7 modules from sentation of the author’s text. Applying matrix factorization on top scikit-learn python library are used. SVD applied on this computed of the matrix from the VSM is known as Vector Space Models of matrices and the basis matrices (U ) alone are kept for further pro- Semantics (VSMs) [7]. This is represented as, cessing. SVD performed by using the numpy python library8 .   dtm Un×n T Σn×m Vm×m dtm = svd D n×m (5) As given in Section 4.2, the class vectors are computed from the   author’s text vectors. As given in Section 4.3, the feature matrix t f id f T t f id f Un×n Σn×m Vm×m = svd D n×m (6) 6 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text. CountVectorizer.html In the above equation, U represents the basis vector representation 7 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text. of the author’s text and it is used while computing class vectors, TfidfVectorizer.html 8 https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.linalg. 5 http://en.rusprofilinglab.ru/rusprofiling-at-pan/ svd.html 2 Corpus Tweets Offline Facebook Tweets Reviews Imitation Total # 600 370 228 400 776 94 authors Table 1: Corpus Statistics Corpus/ Train Offline Facebook Tweets Reviews Imitation Run CV Run1 0.69 0.45 0.49 0.45 0.50 0.45 Run2 1.0 0.49 0.54 0.50 0.50 0.50 Run3 1.0 0.47 0.51 0.46 0.50 0.40 Run4 1.0 0.50 0.52 0.50 0.52 0.50 Run5 0.66 0.50 0.49 0.50 0.50 0.50 Table 2: Results computed by measuring the variation between the author’s text Model and Vectors Space Models of Semantics. Class vectors are vectors and class vectors. The scipy python library used to compute computed from the author’s text vectors and variation between the the cosine distance, Euclidean distance, and correlation measures. class vector and the author’s text vectors are taken as the features The computed feature matrix along with the target classes fed to to perform the Support Vector Machine based classification. The SVM with a linear kernel to perform the final classification. SVM preliminary results showed that class vectors based classification from the scikit-learn python library9 used for classification. In order improve the accuracy by nearly 4% for the final concatenated test to observe the training performance, we have computed the 10-fold corpus. There has been a large performance variation between the cross validation score and given in Table 2 (Train CV). This is given cross-validation score and the score against test corpus. Hence as follows, our future work will be focused more on reducing this margin correctly predicted short texts and computing the class vectors though distributed representation Accuracy = (17) methods. total # short texts Í10 Accuracyi REFERENCES T rain CV = i=1 (18) 10 [1] Barathi Ganesh H. B., M. Anand Kumar, and K. P. Soman. 2016. Statistical Semantics Similar to the training corpus, the feature matrix computed for the in Context Space : Amrita_CEN@Author Profiling. (2016), 881–889. five test corpus given by the shared task organizers and prediction [2] Barathi Ganesh H. B., M. Anand Kumar, and K. P. Soman. 2017. Vector Space Model as Cognitive Space for Text Classification. CoRR abs/1708.06068 (2017). for the author’s texts are found by using the model built in the arXiv:1708.06068 http://arxiv.org/abs/1708.06068 training period. We have totally submitted the five runs and those [3] Tatiana Litvinova, Francisco Rangel, Paolo Rosso, Pavel Seredin, and Olga Litvi- nova. [n. d.]. Overview of the RUSProfiling PAN at FIRE Track on Cross-genre details are given below, Gender Identification in Russian. In Notebook Papers of FIRE 2017, FIRE-2017, Ban- • Run1: TFIDF -> Class Vectors -> Feature Learning -> Classi- galore, India, December 8-10, CEUR Workshop Proceedings. [4] Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christo- fication pher Hamill, Diane Napolitano, and Yao Qian. 2017. A Report on the 2017 Native • Run2: DTM -> Class Vectors -> Feature Learning -> Classifi- Language Identification Shared Task. In Proceedings of the 12th Workshop on Inno- cation vative Use of NLP for Building Educational Applications. 62–75. [5] Francisco Rangel, Paolo Rosso, Martin Potthast, and Benno Stein. 2017. Overview • Run3: TFIDF -> SVD -> Class Vectors -> Feature Learning -> of the 5th author profiling task at pan 2017: Gender and language variety identifi- Classification cation in twitter. Working Notes Papers of the CLEF (2017). [6] Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Potthast, • Run4: DTM -> SVD -> Class Vectors -> Feature Learning -> and Benno Stein. 2016. Overview of the 4th author profiling task at PAN 2016: Classification cross-genre evaluations. Working Notes Papers of the CLEF (2016). • Run5: DTM -> Classification [7] Peter D Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 37 (2010), 141–188. The results reported by the shared task organizers for the submitted five runs are given in Table 2. Out of five runs, Run2 performed better and attained 51.45% as the accuracy measure for the concate- nated test corpus. Run5 attained 47.06% as accuracy measure for the concatenated test corpus and this ensures that class embedding enhanced the accuracy by 4%. 6 CONCLUSION The given train corpus and the test corpus are represented as the author’s text vectors by using methods available in Vector Space 9 http://scikit-learn.org/stable/modules/svm.html 3