=Paper=
{{Paper
|id=Vol-2036/T1-3
|storemode=property
|title=Gender Identification in Russian Texts
|pdfUrl=https://ceur-ws.org/Vol-2036/T1-3.pdf
|volume=Vol-2036
|authors=Rupal Bhargava,Gunjan Goel,Anjali Shah,Yashvardhan Sharma
|dblpUrl=https://dblp.org/rec/conf/fire/BhargavaGSS17
}}
==Gender Identification in Russian Texts==
Gender Identification in Russian Texts Rupal Bhargava, Gunjan Goel, Anjali Shah, Yashvardhan Sharma WiSoc Lab, Department of Computer Science and Information Systems Birla Institute of Technology and Science, Pilani, India 333031 Email: {rupal.bhargava, h2016068, h2016066, yash}@pilani.bits-pilani.ac.in Abstract—Gender Identification is a task where we have to texts. Research in identifying author’s gender started with identify the gender of the author for written texts. An hybrid extensions of this work on categorization and classification approach has been designed by combining deep neural network of text [7]. With regard to the shared task on Author Profiling and a rule-based classifier for russian texts. LSTM and Bi- LSTM have been used as a part of Neural Network due to their at PAN [2, 3], most participants used combinations of style- capability to learn long-term dependencies. based features such as frequency of punctuation marks, capital Index Terms—Author Profillation, Deep Learning, NLP, Gen- letters, quotations, together with POS tags and content-based der Identification, Rule-based Classification features such as Latent Semantic Analysis, bag-of-words, TF- IDF, dictionary-based words, topic-based words. Using the I. I NTRODUCTION various methods and features, researchers have automated The last few years have seen a massive research related prediction of an author’s gender with accuracies ranging from to automatic retrieving of information from the text, mainly 80% to 90%. For instance, the winners of PAN 2015 obtained the information about its author (authorship profiling) like models to classify texts according to the gender of their authors gender, age etc. The automatic extraction of the information with the accuracy as high as 0.97 for Danish and Spanish and from text related to gender is essential to forensics, security, 0.86 for English [4]. and marketing. For example, companies may be interested to Most of the studies have been done in English language for learn about the gender of the people who likes or dislikes their this problem and very few details regarding slavic language products which can then be analyzed to know which section have been studied. [5] have talked about the property of the of the market is disliking their products. It helps in improving Russian Language. The gender of the speaker can be known the sales of a company. in Russian texts if a verb in a sentence is in the past form Most of the studies that have been done for classifying and the subject is a singular first person pronoun “I”. This the gender of the authors have been conducted using English property of the Russian language can be used in identifying texts, like blogs, twitter posts etc. but there have been very the gender and it has been used for classifying the gender in few studies dealing with other languages, especially for Slavic this paper. Many authors have considered deep neural network languages like Russian. The Subtask [1] in the shared task for sentiment analysis but very few[6] have addressed gender of the PAN in FIRE-2017 addresses the Cross-genre Gender classification problem using neural network. It explores the Identification in Russian text (RusProfiling shared task). The potential of deep learning network for the PAN task of Gender objective of this paper is to explore the possibility of auto- Identification in Russian texts. matically classifying Russian written texts according to their authors’ gender using the parameters that are rather context- III. DATA A NALYSIS independent. The training data provided [1], consisted of 600 XML files Let us give a brief description of the text corpora that containing the Russian text. Each file contains the tweets in was given by PAN for evaluation. The corpus was divided the Russian language with the author’s id as the file name. into five subgroups : First group contains Offline texts (pic- A separate file containing the labels for each sample data ture descriptions, letter to a friend etc.) from RusPersonality was provided. As the problem is binary classification, the data Corpus. The second one has posts from Facebook then third was divided into 2 labels as “male” and “female”. Class Type contains tweets from various users. The fourth group has data “male” is assigned to the tweets written by male author and related to products and service online reviews. While the last Class Type “female” was assigned to the tweets written by group contains gender imitation corpus where women are the female author. The dataset contains 300 samples for each imitating men and the other way round. The test corpus is class. widely distributed on the different type of datasets to make Further, analysis of the data showed that the Russian tweets the classifier be context independent. given in the training dataset, also contains many English words. So using only Russian embedding for neural network II. R ELATED W ORK won’t be sufficient so English glove embeddings were also The gender of the authors is one of the characteristics added to enhance our model. Later, translation of data was that may affect the style of writing texts. There are a lot done into English language and after analysis it was concluded of papers on automatic detection of personality traits using that the Russian language can distinguish the gender of the speaker using the verb if the given statement is in past form. then it determines past tense verb and then it checks whether This analysis has improved our results greatly. the verb ends with “ëà” or “ë”. If any of the prefixes is found then the corresponding result is saved otherwise it moves IV. P ROPOSED T ECHNIQUE ahead to check for other terms. A hybrid approach has been proposed which is used to iden- The above rule has been applied on the available training tify the gender from Russian texts. The proposed terminology dataset and classified the texts based on that.. If the text involves the following techniques which are combined together contains “ß” and verb is in past sense then it is classified to generate the final classifier : to male or female. The results are stored in a file, 1 represents • Preprocessing female and 0 represents the male. But, if a rule is not satisfied • Rule-Based Classifier by a text then it is passed to our pre-trained neural network • Neural Network for classification. • Classification C. Neural Network A. Preprocessing The main advantage of using deep learning in classification The given training data was preprocessed for efficient ap- procedure is that it doesn’t need to identify the features on our plication and precise classification of the classifiers.The data own. Neural network trains the model based on the attributes was also preprocessed for separating the class labels. The provided. LSTMs(Long Short Term Memory) is a special kind preprocessing of the dataset includes: of RNN, capable of learning long-term dependencies. These are explicitly designed to avoid the long-term dependency 1) Retrieving the text from XML file : Each XML file problem that exists in RNN. LSTMs can remember informa- contains XML tags. These tags aren’t required for the tion for long periods of time which makes it more useful for training and hence text needs were extracted from the texts. We have utilized the behaviour of LSTMs in our model XML tags. so that it can learn more effectively. 2) Removal of emoticons, numbers, punctuation marks : The entire training data is split into training and validation After extracting the tweet text, emoticons and numbers data with VALIDATION_SPLIT of 0.2. The input to the neural were filtered using TweetTokenizer API of NLTK. This network is the text provided in the training dataset. Each text module also includes the filtering of the unnecessary acts as one input to the neural network. We padded the text punctuation marks. sequences to a maximum length which allows to having a 3) Filtering stopwords : Some common terms were dropped fixed dimension in the final matrix. An embedding is created before creating the word embedding in order to train for the words present in the text that uses the available pre- the model precisely. An available list of Russian stop trained word embeddings of Russian and English. words and customized it according to our need. For e.g, This embedding matrix is fed to the very first layer of the for applying rule-based classifier, the verb shouldn’t be network i.e Embedding Layer. The output is then fed to the considered as a stop word. LSTM layer with filter width of 5. The value was decided after 4) Case Conversion : All the data entries are converted experimentation. Then, a DENSE layer is used with sigmoid into lowercase. This technique involves the replacement as an activation function. The optimizer used to learn the of the upper case letters to their lowercase counterparts. neural network is RMSProp with a learning rate of 0.1. The B. Rule-Based Classifier final predictions are verified with the ground truth and loss is calculated using binary cross entropy and the validation The author’s gender is known to be explicitly expressed data is used to calculate this loss and the model is trained in Russian texts if a verb in a sentence is in the past form accordingly.The final model is converted to JSON and stored. and the subject is a singular first-person pronoun ÿ.Compare: The weights corresponding to the model are stored in an h5 "Ïðîøëîé çèìîéÿ åçäèëà â Àëüïû" (Last winter I went file. Table 1 contains all the required hyperparameters for the in the Alps - a female speaker); "Ïðîøëîé çèìîéÿ åçäèë â neural network. These parameters were decided after rigorous Àëüïû" (Last winter I travelled to the Alps - a male speaker). training and experimenting with various values. If the subject is not the pronoun "ÿ" or if the verb is not in the past form, the gender of the speaker is not explicit. Table I Compare:"ß ïîåäó â Àëüïû" (I’ll go to the Alps); the gender H YPERPARAMETERS FOR NEURAL NETWORK of the speaker is not explicit. HyperParameter Value It is worth emphasizing that the existence of grammatical Learning Rate 0.1 forms which reflect the speaker’s gender does not automati- BatchSize 60 cally make gender identification in Russian texts a trivial task. Epoch 15 Filter Width 5 Any non-first person narrative does not indicate the gender of its author. Besides, it is easy for the author to imitate the speech of an individual of the other gender using the D. Classification above forms. NLTK toolkit has been used to generate Parts of The final step of any classification problem is finding out Speech(POS) tagging. If a term has a tag “VBD” or “VBN” the class of the input data. Here. in author profillation we need to find out the gender of the author which is a binary classification problem. The value associated with the female is 1 and for the male, it’s 0. A hybrid approach has been used where the rule-based classifier and neural network are combined together to solve the problem. The input test data is preprocessed using the technique mentioned in 4A. After preprocessing the text is given to rule- based classifier and it checks whether the text can be classified based on it. If yes, then store the result in a file. If not, then fed the data to the pre-trained neural network and store the prediction in a separate file. Finally, we merge the two files to store the results of entire test data. Figure 2. Comparison of highest accuracy for participating teams The task was evaluated on the basis of accuracy and our team BITS_PILANI has achieved an accuracy of 87.28% for facebook posts. Combining the rule-based classifier along with the Neural Network resulted in the highest accuracy among all the test runs that were submitted as a part of the task. It is because of the neural network, that the model performs good for “Gender Imitation” dataset as well. Figure 5.1 shows the accuracy of all the participating teams for the various test datasets. The test dataset are as follows: 1) Offline Texts (picture descriptions, letter to a friend) from RUSPROFILLING corpus 2) Facebook Posts 3) Twitter tweets Figure 1. Classifier for gender identification 4) Product and Service online reviews 5) Gender Imitation Corpus The final output file contains file name of the test data and the corresponding prediction is provided as 1 or 0 where 1 The proposed approach provides good accuracy as the fact represents female and 0 represents the male. that in Russian languages the gender of the author can be distinguished using verbs. V. E XPERIMENTS VI. C ONCLUSION The FIRE task involves the classification of the author into male and female. The training dataset contained 600 tweets In this paper, the hybrid approach of rule-based classifier sample in the Russian language. Among those, 480 samples and Neural Network has been proposed for task of PAN, were used for training the model and remaining 120 samples FIRE 2017. Presented approach uses LSTM and Bi-LSTM were used for testing. The test dataset included tweets, Face- as the Neural Network trained with the given dataset. The book posts, online products reviews, texts describing images, trained network can still perform better if trained with large or letters to a friend and gender imitation corpus. The main dataset as neural network performs better when trained with objective of the task was to identify the gender of the author huge dataset. The results can still be further improved if of this text. Total five teams have participated and each team morphological, stylistic and content based features can also used three different approaches for classification and generated be added to the rule-based classifier. It would be interesting to results as shown in Figure 2. Our proposed hybrid approach find certain language-based features and hyperparameters for which contains rule-based classifier along with Deep Neural Neural Network that may further improve the accuracy. Network has achieved an accuracy of 87.28%. We submitted a total of 5 runs based on 5 different classifiers. (Rule-based R EFERENCES classifier, Neural Network using LSTM, Bi-LSTM and com- [1] Tatiana Litvinova, Francisco Rangel, Paolo Rosso, Pavel Seredin, Olga bining both with rule-based classifier). The hybrid approach Litvinova. Overview of the RUSProfiling PAN at FIRE Track on Cross- genre Gender Identification in Russian. In: Notebook Papers of FIRE has performed well in all the test datasets. The results are 2017, FIRE-2017, Bangalore, India, December 8-10, CEUR Workshop discussed further in the subsequent section. Proceedings. CEUR-WS.org [2] Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G. “Overview of the Author Profiling Task at PAN 2013 Notebook for PAN at CLEF 2013”, Forner et al. [3] Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W. “Overview of the 2nd author profiling task at pan 2014”, CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org [4] Rangel, F., Fabio, C., Rosso, P, Potthast, M., Stein, B., Daelemans W. “Overview of the 3rd Author Profiling Task at PAN 2015”, CEUR Workshop Proceedings. Toulouse, France [5] T. Litvinova, P.Seredin, O. Litvinova, O. Zagorovskaya, A. Sboev, D. Gu- dovskih, I. Moloshnikov, R. Rybka “Predicting The Gender of an Author of a Russian Text Using Regression and Classification Techniques”, in Proc of CDUD 2016 [6] Sboev A., Litvinova T., Voronina I., Gudovskikh D., Rybka R. “Deep Learning Network Models to Categorize Texts According to Author’s Gender and to Identify Text Sentiment”, Proceedings of 2016 Interna- tional Conference on Computational Science and Computational Intelli- gence [7] Koppel, M., Argamon, S., Shimoni, A.R. “Automatically categorizing written texts by author gender”, Literary Linguist. Comput.