=Paper= {{Paper |id=Vol-2036/T1-3 |storemode=property |title=Gender Identification in Russian Texts |pdfUrl=https://ceur-ws.org/Vol-2036/T1-3.pdf |volume=Vol-2036 |authors=Rupal Bhargava,Gunjan Goel,Anjali Shah,Yashvardhan Sharma |dblpUrl=https://dblp.org/rec/conf/fire/BhargavaGSS17 }} ==Gender Identification in Russian Texts== https://ceur-ws.org/Vol-2036/T1-3.pdf
                 Gender Identification in Russian Texts
                           Rupal Bhargava, Gunjan Goel, Anjali Shah, Yashvardhan Sharma
                          WiSoc Lab, Department of Computer Science and Information Systems
                               Birla Institute of Technology and Science, Pilani, India 333031
                         Email: {rupal.bhargava, h2016068, h2016066, yash}@pilani.bits-pilani.ac.in


   Abstract—Gender Identification is a task where we have to       texts. Research in identifying author’s gender started with
identify the gender of the author for written texts. An hybrid     extensions of this work on categorization and classification
approach has been designed by combining deep neural network        of text [7]. With regard to the shared task on Author Profiling
and a rule-based classifier for russian texts. LSTM and Bi-
LSTM have been used as a part of Neural Network due to their       at PAN [2, 3], most participants used combinations of style-
capability to learn long-term dependencies.                        based features such as frequency of punctuation marks, capital
   Index Terms—Author Profillation, Deep Learning, NLP, Gen-       letters, quotations, together with POS tags and content-based
der Identification, Rule-based Classification                      features such as Latent Semantic Analysis, bag-of-words, TF-
                                                                   IDF, dictionary-based words, topic-based words. Using the
                      I. I NTRODUCTION                             various methods and features, researchers have automated
   The last few years have seen a massive research related         prediction of an author’s gender with accuracies ranging from
to automatic retrieving of information from the text, mainly       80% to 90%. For instance, the winners of PAN 2015 obtained
the information about its author (authorship profiling) like       models to classify texts according to the gender of their authors
gender, age etc. The automatic extraction of the information       with the accuracy as high as 0.97 for Danish and Spanish and
from text related to gender is essential to forensics, security,   0.86 for English [4].
and marketing. For example, companies may be interested to            Most of the studies have been done in English language for
learn about the gender of the people who likes or dislikes their   this problem and very few details regarding slavic language
products which can then be analyzed to know which section          have been studied. [5] have talked about the property of the
of the market is disliking their products. It helps in improving   Russian Language. The gender of the speaker can be known
the sales of a company.                                            in Russian texts if a verb in a sentence is in the past form
   Most of the studies that have been done for classifying         and the subject is a singular first person pronoun “I”. This
the gender of the authors have been conducted using English        property of the Russian language can be used in identifying
texts, like blogs, twitter posts etc. but there have been very     the gender and it has been used for classifying the gender in
few studies dealing with other languages, especially for Slavic    this paper. Many authors have considered deep neural network
languages like Russian. The Subtask [1] in the shared task         for sentiment analysis but very few[6] have addressed gender
of the PAN in FIRE-2017 addresses the Cross-genre Gender           classification problem using neural network. It explores the
Identification in Russian text (RusProfiling shared task). The     potential of deep learning network for the PAN task of Gender
objective of this paper is to explore the possibility of auto-     Identification in Russian texts.
matically classifying Russian written texts according to their
authors’ gender using the parameters that are rather context-                           III. DATA A NALYSIS
independent.                                                          The training data provided [1], consisted of 600 XML files
   Let us give a brief description of the text corpora that        containing the Russian text. Each file contains the tweets in
was given by PAN for evaluation. The corpus was divided            the Russian language with the author’s id as the file name.
into five subgroups : First group contains Offline texts (pic-     A separate file containing the labels for each sample data
ture descriptions, letter to a friend etc.) from RusPersonality    was provided. As the problem is binary classification, the data
Corpus. The second one has posts from Facebook then third          was divided into 2 labels as “male” and “female”. Class Type
contains tweets from various users. The fourth group has data      “male” is assigned to the tweets written by male author and
related to products and service online reviews. While the last     Class Type “female” was assigned to the tweets written by
group contains gender imitation corpus where women are             the female author. The dataset contains 300 samples for each
imitating men and the other way round. The test corpus is          class.
widely distributed on the different type of datasets to make          Further, analysis of the data showed that the Russian tweets
the classifier be context independent.                             given in the training dataset, also contains many English
                                                                   words. So using only Russian embedding for neural network
                     II. R ELATED W ORK                            won’t be sufficient so English glove embeddings were also
   The gender of the authors is one of the characteristics         added to enhance our model. Later, translation of data was
that may affect the style of writing texts. There are a lot        done into English language and after analysis it was concluded
of papers on automatic detection of personality traits using       that the Russian language can distinguish the gender of the
speaker using the verb if the given statement is in past form.      then it determines past tense verb and then it checks whether
This analysis has improved our results greatly.                     the verb ends with “ëà” or “ë”. If any of the prefixes is found
                                                                    then the corresponding result is saved otherwise it moves
                 IV. P ROPOSED T ECHNIQUE                           ahead to check for other terms.
   A hybrid approach has been proposed which is used to iden-          The above rule has been applied on the available training
tify the gender from Russian texts. The proposed terminology        dataset and classified the texts based on that.. If the text
involves the following techniques which are combined together       contains “ß” and verb is in past sense then it is classified
to generate the final classifier :                                  to male or female. The results are stored in a file, 1 represents
   • Preprocessing                                                  female and 0 represents the male. But, if a rule is not satisfied
   • Rule-Based Classifier                                          by a text then it is passed to our pre-trained neural network
   • Neural Network                                                 for classification.
   • Classification                                                 C. Neural Network
A. Preprocessing                                                       The main advantage of using deep learning in classification
   The given training data was preprocessed for efficient ap-       procedure is that it doesn’t need to identify the features on our
plication and precise classification of the classifiers.The data    own. Neural network trains the model based on the attributes
was also preprocessed for separating the class labels. The          provided. LSTMs(Long Short Term Memory) is a special kind
preprocessing of the dataset includes:                              of RNN, capable of learning long-term dependencies. These
                                                                    are explicitly designed to avoid the long-term dependency
   1) Retrieving the text from XML file : Each XML file
                                                                    problem that exists in RNN. LSTMs can remember informa-
       contains XML tags. These tags aren’t required for the
                                                                    tion for long periods of time which makes it more useful for
       training and hence text needs were extracted from the
                                                                    texts. We have utilized the behaviour of LSTMs in our model
       XML tags.
                                                                    so that it can learn more effectively.
   2) Removal of emoticons, numbers, punctuation marks :
                                                                       The entire training data is split into training and validation
       After extracting the tweet text, emoticons and numbers
                                                                    data with VALIDATION_SPLIT of 0.2. The input to the neural
       were filtered using TweetTokenizer API of NLTK. This
                                                                    network is the text provided in the training dataset. Each text
       module also includes the filtering of the unnecessary
                                                                    acts as one input to the neural network. We padded the text
       punctuation marks.
                                                                    sequences to a maximum length which allows to having a
   3) Filtering stopwords : Some common terms were dropped
                                                                    fixed dimension in the final matrix. An embedding is created
       before creating the word embedding in order to train
                                                                    for the words present in the text that uses the available pre-
       the model precisely. An available list of Russian stop
                                                                    trained word embeddings of Russian and English.
       words and customized it according to our need. For e.g,         This embedding matrix is fed to the very first layer of the
       for applying rule-based classifier, the verb shouldn’t be    network i.e Embedding Layer. The output is then fed to the
       considered as a stop word.                                   LSTM layer with filter width of 5. The value was decided after
   4) Case Conversion : All the data entries are converted          experimentation. Then, a DENSE layer is used with sigmoid
       into lowercase. This technique involves the replacement      as an activation function. The optimizer used to learn the
       of the upper case letters to their lowercase counterparts.   neural network is RMSProp with a learning rate of 0.1. The
B. Rule-Based Classifier                                            final predictions are verified with the ground truth and loss
                                                                    is calculated using binary cross entropy and the validation
   The author’s gender is known to be explicitly expressed
                                                                    data is used to calculate this loss and the model is trained
in Russian texts if a verb in a sentence is in the past form
                                                                    accordingly.The final model is converted to JSON and stored.
and the subject is a singular first-person pronoun ÿ.Compare:
                                                                    The weights corresponding to the model are stored in an h5
"Ïðîøëîé çèìîéÿ åçäèëà â Àëüïû" (Last winter I went
                                                                    file. Table 1 contains all the required hyperparameters for the
in the Alps - a female speaker); "Ïðîøëîé çèìîéÿ åçäèë â
                                                                    neural network. These parameters were decided after rigorous
Àëüïû" (Last winter I travelled to the Alps - a male speaker).
                                                                    training and experimenting with various values.
If the subject is not the pronoun "ÿ" or if the verb is not
in the past form, the gender of the speaker is not explicit.                                   Table I
Compare:"ß ïîåäó â Àëüïû" (I’ll go to the Alps); the gender                     H YPERPARAMETERS FOR NEURAL NETWORK
of the speaker is not explicit.
                                                                                        HyperParameter     Value
   It is worth emphasizing that the existence of grammatical                             Learning Rate      0.1
forms which reflect the speaker’s gender does not automati-                                BatchSize        60
cally make gender identification in Russian texts a trivial task.                            Epoch          15
                                                                                          Filter Width       5
Any non-first person narrative does not indicate the gender
of its author. Besides, it is easy for the author to imitate
the speech of an individual of the other gender using the           D. Classification
above forms. NLTK toolkit has been used to generate Parts of          The final step of any classification problem is finding out
Speech(POS) tagging. If a term has a tag “VBD” or “VBN”             the class of the input data. Here. in author profillation we
need to find out the gender of the author which is a binary
classification problem. The value associated with the female
is 1 and for the male, it’s 0. A hybrid approach has been
used where the rule-based classifier and neural network are
combined together to solve the problem.
   The input test data is preprocessed using the technique
mentioned in 4A. After preprocessing the text is given to rule-
based classifier and it checks whether the text can be classified
based on it. If yes, then store the result in a file. If not, then
fed the data to the pre-trained neural network and store the
prediction in a separate file. Finally, we merge the two files
to store the results of entire test data.




                                                                         Figure 2. Comparison of highest accuracy for participating teams



                                                                        The task was evaluated on the basis of accuracy and our
                                                                     team BITS_PILANI has achieved an accuracy of 87.28% for
                                                                     facebook posts. Combining the rule-based classifier along with
                                                                     the Neural Network resulted in the highest accuracy among
                                                                     all the test runs that were submitted as a part of the task.
                                                                     It is because of the neural network, that the model performs
                                                                     good for “Gender Imitation” dataset as well. Figure 5.1 shows
                                                                     the accuracy of all the participating teams for the various test
                                                                     datasets. The test dataset are as follows:
                                                                        1) Offline Texts (picture descriptions, letter to a friend)
                                                                            from RUSPROFILLING corpus
                                                                        2) Facebook Posts
                                                                        3) Twitter tweets
             Figure 1. Classifier for gender identification
                                                                        4) Product and Service online reviews
                                                                        5) Gender Imitation Corpus
   The final output file contains file name of the test data and
the corresponding prediction is provided as 1 or 0 where 1              The proposed approach provides good accuracy as the fact
represents female and 0 represents the male.                         that in Russian languages the gender of the author can be
                                                                     distinguished using verbs.
                        V. E XPERIMENTS
                                                                                              VI. C ONCLUSION
   The FIRE task involves the classification of the author into
male and female. The training dataset contained 600 tweets              In this paper, the hybrid approach of rule-based classifier
sample in the Russian language. Among those, 480 samples             and Neural Network has been proposed for task of PAN,
were used for training the model and remaining 120 samples           FIRE 2017. Presented approach uses LSTM and Bi-LSTM
were used for testing. The test dataset included tweets, Face-       as the Neural Network trained with the given dataset. The
book posts, online products reviews, texts describing images,        trained network can still perform better if trained with large
or letters to a friend and gender imitation corpus. The main         dataset as neural network performs better when trained with
objective of the task was to identify the gender of the author       huge dataset. The results can still be further improved if
of this text. Total five teams have participated and each team       morphological, stylistic and content based features can also
used three different approaches for classification and generated     be added to the rule-based classifier. It would be interesting to
results as shown in Figure 2. Our proposed hybrid approach           find certain language-based features and hyperparameters for
which contains rule-based classifier along with Deep Neural          Neural Network that may further improve the accuracy.
Network has achieved an accuracy of 87.28%. We submitted
a total of 5 runs based on 5 different classifiers. (Rule-based                                  R EFERENCES
classifier, Neural Network using LSTM, Bi-LSTM and com-              [1] Tatiana Litvinova, Francisco Rangel, Paolo Rosso, Pavel Seredin, Olga
bining both with rule-based classifier). The hybrid approach             Litvinova. Overview of the RUSProfiling PAN at FIRE Track on Cross-
                                                                         genre Gender Identification in Russian. In: Notebook Papers of FIRE
has performed well in all the test datasets. The results are             2017, FIRE-2017, Bangalore, India, December 8-10, CEUR Workshop
discussed further in the subsequent section.                             Proceedings. CEUR-WS.org
[2] Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G. “Overview
    of the Author Profiling Task at PAN 2013 Notebook for PAN at CLEF
    2013”, Forner et al.
[3] Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein,
    B., Verhoeven, B., Daelemans, W. “Overview of the 2nd author profiling
    task at pan 2014”, CLEF 2014 Labs and Workshops, Notebook Papers.
    CEUR-WS.org
[4] Rangel, F., Fabio, C., Rosso, P, Potthast, M., Stein, B., Daelemans W.
    “Overview of the 3rd Author Profiling Task at PAN 2015”, CEUR
    Workshop Proceedings. Toulouse, France
[5] T. Litvinova, P.Seredin, O. Litvinova, O. Zagorovskaya, A. Sboev, D. Gu-
    dovskih, I. Moloshnikov, R. Rybka “Predicting The Gender of an Author
    of a Russian Text Using Regression and Classification Techniques”, in
    Proc of CDUD 2016
[6] Sboev A., Litvinova T., Voronina I., Gudovskikh D., Rybka R. “Deep
    Learning Network Models to Categorize Texts According to Author’s
    Gender and to Identify Text Sentiment”, Proceedings of 2016 Interna-
    tional Conference on Computational Science and Computational Intelli-
    gence
[7] Koppel, M., Argamon, S., Shimoni, A.R. “Automatically categorizing
    written texts by author gender”, Literary Linguist. Comput.