Author Profiling Using Support Vector Machines Notebook for PAN at CLEF 2016

Author Profiling Using Support Vector Machines Notebook for PAN at CLEF 2016 RodwanBakkarDeyab Departamento de Informática Escola de Ciências e Tecnologia Universidade de Évora

Rua Romão Ramalho, 59 7000-671 Évora Portugal

JoséDuarte Departamento de Informática Escola de Ciências e Tecnologia Universidade de Évora

Rua Romão Ramalho, 59 7000-671 Évora Portugal

TeresaGonçalves Departamento de Informática Escola de Ciências e Tecnologia Universidade de Évora

Rua Romão Ramalho, 59 7000-671 Évora Portugal

Author Profiling Using Support Vector Machines Notebook for PAN at CLEF 2016 FDAE52D9B0FCCFCAE2E36C246A2A0486 GROBID - A machine learning software for extracting information from scholarly documents PAN CLEF Author Profiling Machine Learning Twitter Support Vector Machines Bag-of-Words

The objective of this work is to identify the gender and age of the author of a set of tweets using Support Vector Machines. This work is done as a task for the PAN 2016 which is a part of the CLEF conference. Techniques like tagging, removing stopwords, stemming, Bag-of-Words representation were used in order to create a 10 classes model. The tuning of the model was based on grid-search using k-fold cross-validation. The model was tested for precision and recall with the corpus from PAN 2015 and PAN 2016 and the results are presented. We have experienced the Peaking Phenomenon with the increment of the number of features. In the future we plan to try the term frequency-inverse document frequency in order to improve our results.

Introduction

Author profiling problem is about detecting some characteristics (age, gender, for example) of the author of some piece of text depending on the features (eg. lexical, syntactical) of this text. Men and women, and of different ages, write in different ways. Having a dataset in hand, written by different authors of different characteristics, we can train the machine using this dataset so it can predict these characteristics of an unseen piece of text fed to it. PAN 16 1 author profiling task provides a dataset of tweets for the sake of developing an author profiling system. The task is about predicting the age and the gender of the author. Machine learning technique suits to achieve this goal. Support Vector Machines (SVMs) [3] can be used as a multi-class classifier which could be trained using the dataset provided to produce a model which can be consulted on an unseen set of tweets written by some author to predict his age and gender. Bag-of-Words (BOW) [14] is a simplified representation of the text corpus which contains all the words used in it with their frequencies. BOW representation is used in many areas like Natural Language Processing [13], Information Retrieval [5], Document Classification and among others [14]. In our work we use SVMs and BOW representation. We use the python machine learning library, scikit-learn [7]. After we produced the best possible model trained on PAN 16 author profiling dataset, we ran some tests over the test sets provided by Tira [2,9]. The work presented in this paper was reviewed and is part of the PAN 2016 overview [11].

This paper is organized as follows: in section 2, the Implementation is described; in section 3, we present the results with features selected and evaluation criteria; in section 4, a retrospective analysis of the work is preformed and a future vision is suggested.

Implementation

In this section we describe all the steps of creating the model. We first analyse the dataset, then we present the architecture of the system and at the end we explain the implementation of it.

The dataset

We used the dataset2 provided by the PAN 2016 in our study. The corpus contains 436 files, each file contains a set of tweets and these files are written by different authors. The information about each file written by which author is indexed by a file called turth file. The file structure is shown in (1) and is explained in Table 1.

AID ::: G ::: AR The Table 2 shows the distribution of the data after analysing it. For example, the corpus contains 14 files written by female authors which have ages between 18 and 24.

System Architecture

Our system has three modules: preprocessing, training and testing modules. In figure 1 we show the architecture of the system in the training phase. In figure 2 we present the architecture of the system in the testing phase. Both of them use the preprocessing module. Social Media like twitter is a very noisy environment where informal texts can thrive. As the space is noisy and it does not comply with the syntactic rules of the natural language, NLP (Natural Language Processing) [13] can not be exploited to the best extent.

In our study, we use the BOW [14] representation of the corpus as set of features. Before the BOW generation the data had been transformed. The objective is to opti-Figure 2: The architecture of the system: testing phase mize the BOW representation by reducing the words set of the corpus without losing information. This preprocessing is done in three steps.

The data in the corpus comes from twitter and has the nature of being noisy containing a lot of abbreviations and special expressions. These special expressions can hold important clues that can differentiate the characteristics of the authors. A regular expression parser has been created in order to replace all of these special expressions with predefined tags. This first step allows to group expressions and reduce the words set, without losing information. The list of tags with few examples of tokens replaced by them is shown in Table 3.

The Second step consists of removing the stopwords form the corpus. Stop words are a set of words like prepositions ("in", "on", "to") and conjunctions ("and", "or"). Usually they carry no information and they are used a lot in the context. The Natural Language Toolkit (NLTK) [1] has a list of English stop words and the scikit-learn [7] too. In the work presented, two lists were merged and used to filter out the corpus.

The third step in the stemming. Stemming [6] is the process of finding the root (the lemma) of a given word. Stemming is used in Information Retrieval [5] such that, for example, words like "connect, connected, connecting, connection and connections" would be considered as one search word which is the stem of these words "connect". It is useful for the BOW representation such that it reduces the number of the tokens as it may reduce many words to their root and use them as if they were one word. NLTK provides many algorithms for stemming. We used the SnowballStemmer [8] algorithm in our work. The result of the preprocessing module is the BOW model as a list of lists such that each list represents a file of the dataset. The list length is equal to the number of features chosen. The numbers in the list represent the frequency of each word of the Bag-of-Words (the features) in each file in a descending order.

Training Module

Our training module is the core of the work done. It uses the data preparing module to convert the training dataset to the BOW representation as explained before. Each word in the BOW is considered as a feature. We do not use the whole BOW as features but we limit the number, this will be discussed in the result section. After getting the BOW representation of the dataset we divide it into two parts; one part for training (it is two thirds of the whole dataset and we call it the development set) and another part for testing (it is one third of the whole dataset and we call it the evaluation set). We divide the dataset using the scikit-learn function train_test_split.

Then, this module seeks to get the best parameters to train an SVMs classifier on the development set. The parameters we seek to get for our SVMs classifier are the kernel, gamma and C parameters. To achieve that we do a hyperparameter tuning through a grid search provided by the scikit-learn library using GridSearchCV function. We define a set of parameters to be used by the grid search function as we show in Table 4.

Grid search uses stratified cross validation once for each pair of the parameters provided keeping track of the results it gets. We used a k-fold cross validation with k = 3. It is more usual to use this technique with k = 10 but due to the small number of some classes, it was not possible as can be seen in Table 2 there are some age ranges with only 3 elements (files). In other words, with classes of small number of files, it was not possible to apply a stratified cross validation with k = 10 correctly.

With "rbf(radial basis function)" kernel in our work. We explain how grid search works by a pseudo code (Code 1).

After we get the model trained on the development set using the best parameters we do a test on the evaluation set. Getting the result of this test, we produce a classification report to show the results in terms of precision, recall, f1-score and support. This will be discussed in the result section. We then used the best parameters we obtained from the grid search to train a classifier on the whole PAN 16 dataset and produce our model which we used to do the Tira tests.

The Testing Module

This module will take again the benefit of the preprocessing module to get the dataset in a suitable format (BOW representation) to consult the model was produced by the training module. It will consult the produced model to predict the age and gender of the author of each file of the test dataset and it will produce an XML file for each one of them. The description of the XML file format is shown in Description 1. The set of XML files will be the input of the Tira evaluation where accuracy will be calculated as a performance measure.

We hint here that our system was developed just for English language.

We did many tests over many datasets (the evaluation sets of them) using different sets of features. Our features, as we mentioned before, are the words formed by the BOW such that each word is considered as a feature (taking the frequency of it in each document).

Our results are produced using the classification_report, provided by scikit-learn, over the testing results on the evaluation sets. After we obtain the model using grid search over the development set, we use it to predict over the evaluation set and we run the classification report over the result of prediction. Classification report takes the real target and the predicted target to calculate the precision, recall, f1-score and support for each class was predicted and it calculates the average of these metrics.

First we present some results of the tests on the PAN 16 dataset which has ten classes.

In table 5, we show the results after using a number of features equal to 10000. In table 6, we show the results after using a number of features equal to 100.

We hint here that the class 10 does not appear in the classification report and that is because of the PAN 16 dataset which contains 436 files, has only 3 files of this class and the way we divided the dataset into a development set and an evaluation set did not give any file of the class 10 to the evaluation set. Now we show some results of tests we did on the PAN 15 dataset which has only eight classes. Using a number of features equal to 10000, we present the results in Table 7.

And in Table 8 we present the results after using a number of features equal to 100.

We further discuss the results in the conclusion.

Conclusion and future work

We decided to use the BOW representation as features for our classifier after observing the nature of texts in the social media like twitter. The process of making a parser to replace the special pieces of texts which may mean important in this kind of text and making the BOW (after stemming and stopwords removal) of the resulting tagged text may suit well for this task. But, selecting the right features for SVMs is not an easy task. There are many issues that should be taken into consideration. The scale range of each feature can be a problem [4]. We notice that the results were better for PAN 15 than for PAN 16. That could be because of the tagging process, when we tag the dataset to match special mentions like links and smiles, these special mentions could be found more often in the PAN 15 dataset than in the PAN 16 dataset. In other words, the tagger behaviour is not guaranteed and that depends on the essence of the dataset.

We also notice, from these tests on PAN 16 and PAN 15 datasets that increasing the number of features does not mean necessarily better results. For example, when we used a number of features equal to 100 in the test done for PAN 16 dataset, we got a precision equal to 0.3 and we got the same value of precision for a number of features equal to 10000 for the same test. This is known as the Peaking Phenomenon [12] (PP) and it can occur when using a high number of features. The performance of a model is not proportional to the number of features used, there is a point where the performance deteriorates when more features are added to the model. Procedures already presented in Section 2 like preprocessing the text using tagging, stopwords removal and stemming before creating the BOW representation can help to minimize this problem.

There are many things that could be done or improved in order to continue this study. A true Random Search could be implemented in order to improve the features selection and parameters tuning. It could also be improved by adding features extracted with respect to the natural language (syntactic and semantic features, for example). Natural Language Processing [13] can be exploited to achieve that. But as we mentioned before it may not be possible to exploit it to the best extent as the nature of this environment is noisy.

The use of the term frequency-inverse document frequency (tf-idf) technique [10] and tuning the maximum size of the BOW can help too. In fact the scikit-learn provides the necessary functions to use tf-idf technique and it could be a good experiment to do as a future work.

Figure 1 :1Figure 1: The architecture of the system: training phase

Description 1 :1XML file format description < a u t h o r i d =" a u t h o r −i d " t y p e =" n o t r e l e v a n t " l a n g =" en | e s | n l " a g e _ g r o u p ="18 −24|25 −34|35 −49|50 −64|65 − xx " g e n d e r =" male | f e m a l e " / >

Table

Table 2 :2Distribution of the data in the corpus

Gender Age Range Number of files Total18-2414 (3%)Females25-34 35-4970 (16%) 91 (20%)21850-6440 (9%)65-xx3 (0.6%)18-2414 (3%)Males25-34 35-4970 (16%) 91 (20%)21850-6440 (9%)65-xx3 (0.6%)Total436

Table 3 :3Special tags used to preprocess the data corpusTagExamples_LINK_TAGhttp://t.co/jtQvfIJIyg_NOSY_EMOJI_TAG:-) :-D :-(_SIMPLE_EMOJI_TAG:) :D :(_FIGURE_EMOJI_TAG(K) <3_FUNNY_EYES_EMOJI_TAG =) =D =(_HORIZ_EMOJI_TAG*.* o.O ^._RUDE_TALK_TAGF*** stupid_LAUGH_TAGhaha Lol eheheeh_PUNCTUATION_ABUSE_TAG !! ????_EXPRESSIONS_TAGops whoa whow_SHARE_PIC_TAG[pic]_MENTION_TAG@username_HASHTAG_TAG#Paris_NEW_LINE_TAGa new line in the tweet

Table 4 :4Grid Search values for Gamma and CResults include the cross-validated-mean-score and the standard deviation. The best parameters are those which produce the highest mean and the lowest standard deviation. For example, (2) is the result which refers to the best parameters after doing the grid search over the PAN 16 dataset.123456789Gamma 0.0001 0.001 0.010.11101001000 10000C0.0001 0.001 0.010.11101001000 10000Code 1: Grid Search pseudo-codef o r e a c h _ c i n c _ l i s t :f o r each_gamma i n g a m m a _ l i s t :r e s u l t s [ i ] =3− f o l d _ c r o s s _ v a l i d a t i o n ( e a c h _ c , each_gamma )

kernel : rbf, gamma : 0.0001, C : 100

Table 5 :5Results for PAN 16 corpus with 10000 featuresprecision recall f1-score support kernel gamma cclass10.000.00 0.002class20.220.15 0.1826class30.310.56 0.3927class40.250.08 0.1212class5 class60.00 0.000.00 0.00 0.00 0.001 2rbf0.0001 100class70.430.46 0.4426class80.290.29 0.2934class90.330.21 0.2614avg / total 0.300.31 0.29144

Table 6 :6Results for PAN 16 corpus with 100 featuresprecision recall f1-score support kernel gamma cclass10.000.00 0.002class20.320.23 0.2726class30.340.67 0.4527class40.220.17 0.1912class5 class60.00 0.000.00 0.00 0.00 0.001 2rbf0.01 10class70.430.35 0.3826class80.290.29 0.2934class90.140.07 0.1014avg / total 0.300.32 0.30144

Table 7 :7Results for PAN 15 corpus with 10000 featuresprecision recall f1-score support kernel gamma cclass10.710.45 0.5611class20.880.58 0.7012class31.000.29 0.447class40.000.00 0.003class5 class60.55 0.140.75 0.63 1.00 0.258 3rbf0.0001 100class71.000.67 0.803class80.000.00 0.004avg / total 0.650.49 0.5151

Table 8 :8Results for PAN 15 corpus with 100 featuresPrecision Recall F1-score Support Kernel Gamma CClass 10.710.45 0.5611Class 20.650.92 0.7612Class 30.670.29 0.407Class 40.000.00 0.003Class 5 Class 60.60 0.180.75 0.67 0.67 0.298 3rbf0.0110Class 71.000.33 0.503Class 81.000.25 0.404Avg / Total 0.640.55 0.5451

http://pan.webis.de/clef16/pan16-web/author-profiling.html Corpus available in http://pan.webis.de/clef16/pan16-web/author-profiling.html

Acknowledgement

We want to address our thanks to the Departamento de Informática da Escola de Ciências e Tecnologia da Universidade de Évora, for all the support to our work.

Nltk: the natural language toolkit SBird Proceedings of the COLING/ACL on Interactive presentation sessions the COLING/ACL on Interactive presentation sessions 2006 Association for Computational Linguistics TIRA: Configuring, Executing, and Disseminating Information Retrieval Experiments TGollub BStein SBurrows DHoppe 9th International Workshop on Text-based Information Retrieval (TIR 12) at DEXA ATjoa SLiddle KDSchewe XZhou

Los Alamitos, California

IEEE Sep 2012 Support vector machines MAHearst STDumais EOsman JPlatt BScholkopf Intelligent Systems and their Applications 1998 13 A practical guide to support vector classification CWHsu CCChang CJLin 2003 Naive (bayes) at forty: The independence assumption in information retrieval DDLewis Machine learning: ECML-98 Springer 1998 Development of a stemming algorithm JBLovins 1968 MIT Information Processing Group, Electronic Systems Laboratory Cambridge Scikit-learn: Machine learning in Python FPedregosa GVaroquaux AGramfort VMichel BThirion OGrisel MBlondel PPrettenhofer RWeiss VDubourg JVanderplas APassos DCournapeau MBrucher MPerrot EDuchesnay Journal of Machine Learning Research 12 2011 MPorter RBoulton Snowball. On line 2001 Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling MPotthast TGollub FRangel PRosso EStamatatos BStein Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14 EKanoulas MLupu PClough MSanderson MHall AHanbury EToms

Berlin Heidelberg New York

Springer Sep 2014 Using tf-idf to determine word relevance in document queries JRamos Proceedings of the first instructional conference on machine learning the first instructional conference on machine learning 2003 Overview of the 4th Author Profiling Task at PAN 2016: Cross-genre Evaluations FRangel PRosso BVerhoeven WDaelemans MPotthast BStein Working Notes Papers of the CLEF 2016 Evaluation Labs CEUR Workshop Proceedings, CLEF and CEUR-WS Sep 2016 The peaking phenomenon in the presence of feature-selection CSima ERDougherty Pattern Recognition Letters 29 11 2008 Cheap and fast-but is it good?: evaluating non-expert annotations for natural language tasks RSnow BO'connor DJurafsky AYNg Proceedings of the conference on empirical methods in natural language processing the conference on empirical methods in natural language processing 2008 Association for Computational Linguistics Understanding bag-of-words model: a statistical framework YZhang RJin ZHZhou International Journal of Machine Learning and Cybernetics 1 1-4 2010