<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Author Profiling Using Support Vector Machines Notebook for PAN at CLEF 2016</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Rodwan</forename><forename type="middle">Bakkar</forename><surname>Deyab</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Departamento de Informática</orgName>
								<orgName type="department" key="dep2">Escola de Ciências e Tecnologia</orgName>
								<orgName type="institution">Universidade de Évora</orgName>
								<address>
									<addrLine>Rua Romão Ramalho, 59</addrLine>
									<postCode>7000-671</postCode>
									<settlement>Évora</settlement>
									<country key="PT">Portugal</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">José</forename><surname>Duarte</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Departamento de Informática</orgName>
								<orgName type="department" key="dep2">Escola de Ciências e Tecnologia</orgName>
								<orgName type="institution">Universidade de Évora</orgName>
								<address>
									<addrLine>Rua Romão Ramalho, 59</addrLine>
									<postCode>7000-671</postCode>
									<settlement>Évora</settlement>
									<country key="PT">Portugal</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Teresa</forename><surname>Gonçalves</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Departamento de Informática</orgName>
								<orgName type="department" key="dep2">Escola de Ciências e Tecnologia</orgName>
								<orgName type="institution">Universidade de Évora</orgName>
								<address>
									<addrLine>Rua Romão Ramalho, 59</addrLine>
									<postCode>7000-671</postCode>
									<settlement>Évora</settlement>
									<country key="PT">Portugal</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Author Profiling Using Support Vector Machines Notebook for PAN at CLEF 2016</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">FDAE52D9B0FCCFCAE2E36C246A2A0486</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T03:17+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>PAN</term>
					<term>CLEF</term>
					<term>Author Profiling</term>
					<term>Machine Learning</term>
					<term>Twitter</term>
					<term>Support Vector Machines</term>
					<term>Bag-of-Words</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The objective of this work is to identify the gender and age of the author of a set of tweets using Support Vector Machines. This work is done as a task for the PAN 2016 which is a part of the CLEF conference. Techniques like tagging, removing stopwords, stemming, Bag-of-Words representation were used in order to create a 10 classes model. The tuning of the model was based on grid-search using k-fold cross-validation. The model was tested for precision and recall with the corpus from PAN 2015 and PAN 2016 and the results are presented. We have experienced the Peaking Phenomenon with the increment of the number of features. In the future we plan to try the term frequency-inverse document frequency in order to improve our results.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Author profiling problem is about detecting some characteristics (age, gender, for example) of the author of some piece of text depending on the features (eg. lexical, syntactical) of this text. Men and women, and of different ages, write in different ways. Having a dataset in hand, written by different authors of different characteristics, we can train the machine using this dataset so it can predict these characteristics of an unseen piece of text fed to it. PAN 16 <ref type="foot" target="#foot_0">1</ref> author profiling task provides a dataset of tweets for the sake of developing an author profiling system. The task is about predicting the age and the gender of the author. Machine learning technique suits to achieve this goal. Support Vector Machines (SVMs) <ref type="bibr" target="#b2">[3]</ref> can be used as a multi-class classifier which could be trained using the dataset provided to produce a model which can be consulted on an unseen set of tweets written by some author to predict his age and gender. Bag-of-Words (BOW) <ref type="bibr" target="#b13">[14]</ref> is a simplified representation of the text corpus which contains all the words used in it with their frequencies. BOW representation is used in many areas like Natural Language Processing <ref type="bibr" target="#b12">[13]</ref>, Information Retrieval <ref type="bibr" target="#b4">[5]</ref>, Document Classification and among others <ref type="bibr" target="#b13">[14]</ref>. In our work we use SVMs and BOW representation. We use the python machine learning library, scikit-learn <ref type="bibr" target="#b6">[7]</ref>. After we produced the best possible model trained on PAN 16 author profiling dataset, we ran some tests over the test sets provided by Tira <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b8">9]</ref>. The work presented in this paper was reviewed and is part of the PAN 2016 overview <ref type="bibr" target="#b10">[11]</ref>.</p><p>This paper is organized as follows: in section 2, the Implementation is described; in section 3, we present the results with features selected and evaluation criteria; in section 4, a retrospective analysis of the work is preformed and a future vision is suggested.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Implementation</head><p>In this section we describe all the steps of creating the model. We first analyse the dataset, then we present the architecture of the system and at the end we explain the implementation of it.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">The dataset</head><p>We used the dataset<ref type="foot" target="#foot_1">2</ref> provided by the PAN 2016 in our study. The corpus contains 436 files, each file contains a set of tweets and these files are written by different authors. The information about each file written by which author is indexed by a file called turth file. The file structure is shown in <ref type="bibr" target="#b0">(1)</ref> and is explained in Table <ref type="table">1</ref>.</p><p>AID ::: G ::: AR   The Table <ref type="table" target="#tab_3">2</ref> shows the distribution of the data after analysing it. For example, the corpus contains 14 files written by female authors which have ages between 18 and 24.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">System Architecture</head><p>Our system has three modules: preprocessing, training and testing modules. In figure <ref type="figure" target="#fig_0">1</ref> we show the architecture of the system in the training phase. In figure <ref type="figure">2</ref> we present the architecture of the system in the testing phase. Both of them use the preprocessing module.  Social Media like twitter is a very noisy environment where informal texts can thrive. As the space is noisy and it does not comply with the syntactic rules of the natural language, NLP (Natural Language Processing) <ref type="bibr" target="#b12">[13]</ref> can not be exploited to the best extent.</p><p>In our study, we use the BOW <ref type="bibr" target="#b13">[14]</ref> representation of the corpus as set of features. Before the BOW generation the data had been transformed. The objective is to opti-Figure <ref type="figure">2</ref>: The architecture of the system: testing phase mize the BOW representation by reducing the words set of the corpus without losing information. This preprocessing is done in three steps.</p><p>The data in the corpus comes from twitter and has the nature of being noisy containing a lot of abbreviations and special expressions. These special expressions can hold important clues that can differentiate the characteristics of the authors. A regular expression parser has been created in order to replace all of these special expressions with predefined tags. This first step allows to group expressions and reduce the words set, without losing information. The list of tags with few examples of tokens replaced by them is shown in Table <ref type="table" target="#tab_4">3</ref>.</p><p>The Second step consists of removing the stopwords form the corpus. Stop words are a set of words like prepositions ("in", "on", "to") and conjunctions ("and", "or"). Usually they carry no information and they are used a lot in the context. The Natural Language Toolkit (NLTK) <ref type="bibr" target="#b0">[1]</ref> has a list of English stop words and the scikit-learn <ref type="bibr" target="#b6">[7]</ref> too. In the work presented, two lists were merged and used to filter out the corpus.</p><p>The third step in the stemming. Stemming <ref type="bibr" target="#b5">[6]</ref> is the process of finding the root (the lemma) of a given word. Stemming is used in Information Retrieval <ref type="bibr" target="#b4">[5]</ref> such that, for example, words like "connect, connected, connecting, connection and connections" would be considered as one search word which is the stem of these words "connect". It is useful for the BOW representation such that it reduces the number of the tokens as it may reduce many words to their root and use them as if they were one word. NLTK provides many algorithms for stemming. We used the SnowballStemmer <ref type="bibr" target="#b7">[8]</ref> algorithm in our work. The result of the preprocessing module is the BOW model as a list of lists such that each list represents a file of the dataset. The list length is equal to the number of features chosen. The numbers in the list represent the frequency of each word of the Bag-of-Words (the features) in each file in a descending order.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Training Module</head><p>Our training module is the core of the work done. It uses the data preparing module to convert the training dataset to the BOW representation as explained before. Each word in the BOW is considered as a feature. We do not use the whole BOW as features but we limit the number, this will be discussed in the result section. After getting the BOW representation of the dataset we divide it into two parts; one part for training (it is two thirds of the whole dataset and we call it the development set) and another part for testing (it is one third of the whole dataset and we call it the evaluation set). We divide the dataset using the scikit-learn function train_test_split.</p><p>Then, this module seeks to get the best parameters to train an SVMs classifier on the development set. The parameters we seek to get for our SVMs classifier are the kernel, gamma and C parameters. To achieve that we do a hyperparameter tuning through a grid search provided by the scikit-learn library using GridSearchCV function. We define a set of parameters to be used by the grid search function as we show in Table <ref type="table" target="#tab_5">4</ref>.</p><p>Grid search uses stratified cross validation once for each pair of the parameters provided keeping track of the results it gets. We used a k-fold cross validation with k = 3. It is more usual to use this technique with k = 10 but due to the small number of some classes, it was not possible as can be seen in Table <ref type="table" target="#tab_3">2</ref> there are some age ranges with only 3 elements (files). In other words, with classes of small number of files, it was not possible to apply a stratified cross validation with k = 10 correctly.</p><p>With "rbf(radial basis function)" kernel in our work. We explain how grid search works by a pseudo code (Code 1). </p><p>After we get the model trained on the development set using the best parameters we do a test on the evaluation set. Getting the result of this test, we produce a classification report to show the results in terms of precision, recall, f1-score and support. This will be discussed in the result section. We then used the best parameters we obtained from the grid search to train a classifier on the whole PAN 16 dataset and produce our model which we used to do the Tira tests.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The Testing Module</head><p>This module will take again the benefit of the preprocessing module to get the dataset in a suitable format (BOW representation) to consult the model was produced by the training module. It will consult the produced model to predict the age and gender of the author of each file of the test dataset and it will produce an XML file for each one of them. The description of the XML file format is shown in Description 1. The set of XML files will be the input of the Tira evaluation where accuracy will be calculated as a performance measure.</p><p>We hint here that our system was developed just for English language.</p><p>We did many tests over many datasets (the evaluation sets of them) using different sets of features. Our features, as we mentioned before, are the words formed by the BOW such that each word is considered as a feature (taking the frequency of it in each document).</p><p>Our results are produced using the classification_report, provided by scikit-learn, over the testing results on the evaluation sets. After we obtain the model using grid search over the development set, we use it to predict over the evaluation set and we run the classification report over the result of prediction. Classification report takes the real target and the predicted target to calculate the precision, recall, f1-score and support for each class was predicted and it calculates the average of these metrics.</p><p>First we present some results of the tests on the PAN 16 dataset which has ten classes.</p><p>In table 5, we show the results after using a number of features equal to 10000. In table <ref type="table" target="#tab_7">6</ref>, we show the results after using a number of features equal to 100.</p><p>We hint here that the class 10 does not appear in the classification report and that is because of the PAN 16 dataset which contains 436 files, has only 3 files of this class and the way we divided the dataset into a development set and an evaluation set did not give any file of the class 10 to the evaluation set. Now we show some results of tests we did on the PAN 15 dataset which has only eight classes. Using a number of features equal to 10000, we present the results in Table <ref type="table" target="#tab_8">7</ref>.</p><p>And in Table <ref type="table" target="#tab_9">8</ref> we present the results after using a number of features equal to 100.</p><p>We further discuss the results in the conclusion. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion and future work</head><p>We decided to use the BOW representation as features for our classifier after observing the nature of texts in the social media like twitter. The process of making a parser to replace the special pieces of texts which may mean important in this kind of text and making the BOW (after stemming and stopwords removal) of the resulting tagged text may suit well for this task. But, selecting the right features for SVMs is not an easy task. There are many issues that should be taken into consideration. The scale range of each feature can be a problem <ref type="bibr" target="#b3">[4]</ref>. We notice that the results were better for PAN 15 than for PAN 16. That could be because of the tagging process, when we tag the dataset to match special mentions like links and smiles, these special mentions could be found more often in the PAN 15 dataset than in the PAN 16 dataset. In other words, the tagger behaviour is not guaranteed and that depends on the essence of the dataset.</p><p>We also notice, from these tests on PAN 16 and PAN 15 datasets that increasing the number of features does not mean necessarily better results. For example, when we used a number of features equal to 100 in the test done for PAN 16 dataset, we got a precision equal to 0.3 and we got the same value of precision for a number of features equal to 10000 for the same test. This is known as the Peaking Phenomenon <ref type="bibr" target="#b11">[12]</ref> (PP) and it can occur when using a high number of features. The performance of a model is not proportional to the number of features used, there is a point where the performance deteriorates when more features are added to the model. Procedures already presented in Section 2 like preprocessing the text using tagging, stopwords removal and stemming before creating the BOW representation can help to minimize this problem.</p><p>There are many things that could be done or improved in order to continue this study. A true Random Search could be implemented in order to improve the features selection and parameters tuning. It could also be improved by adding features extracted with respect to the natural language (syntactic and semantic features, for example). Natural Language Processing <ref type="bibr" target="#b12">[13]</ref> can be exploited to achieve that. But as we mentioned before it may not be possible to exploit it to the best extent as the nature of this environment is noisy.</p><p>The use of the term frequency-inverse document frequency (tf-idf) technique <ref type="bibr" target="#b9">[10]</ref> and tuning the maximum size of the BOW can help too. In fact the scikit-learn provides the necessary functions to use tf-idf technique and it could be a good experiment to do as a future work.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The architecture of the system: training phase</figDesc><graphic coords="3,180.12,294.15,255.11,240.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Description 1 :</head><label>1</label><figDesc>XML file format description &lt; a u t h o r i d =" a u t h o r −i d " t y p e =" n o t r e l e v a n t " l a n g =" en | e s | n l " a g e _ g r o u p ="18 −24|25 −34|35 −49|50 −64|65 − xx " g e n d e r =" male | f e m a l e " / &gt;</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="4,180.12,115.84,255.13,240.95" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table</head><label></label><figDesc></figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2 :</head><label>2</label><figDesc>Distribution of the data in the corpus</figDesc><table><row><cell cols="4">Gender Age Range Number of files Total</cell></row><row><cell></cell><cell>18-24</cell><cell>14 (3%)</cell><cell></cell></row><row><cell>Females</cell><cell>25-34 35-49</cell><cell>70 (16%) 91 (20%)</cell><cell>218</cell></row><row><cell></cell><cell>50-64</cell><cell>40 (9%)</cell><cell></cell></row><row><cell></cell><cell>65-xx</cell><cell>3 (0.6%)</cell><cell></cell></row><row><cell></cell><cell>18-24</cell><cell>14 (3%)</cell><cell></cell></row><row><cell>Males</cell><cell>25-34 35-49</cell><cell>70 (16%) 91 (20%)</cell><cell>218</cell></row><row><cell></cell><cell>50-64</cell><cell>40 (9%)</cell><cell></cell></row><row><cell></cell><cell>65-xx</cell><cell>3 (0.6%)</cell><cell></cell></row><row><cell></cell><cell>Total</cell><cell></cell><cell>436</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3 :</head><label>3</label><figDesc>Special tags used to preprocess the data corpus</figDesc><table><row><cell>Tag</cell><cell>Examples</cell></row><row><cell>_LINK_TAG</cell><cell>http://t.co/jtQvfIJIyg</cell></row><row><cell>_NOSY_EMOJI_TAG</cell><cell>:-) :-D :-(</cell></row><row><cell>_SIMPLE_EMOJI_TAG</cell><cell>:) :D :(</cell></row><row><cell>_FIGURE_EMOJI_TAG</cell><cell>(K) &lt;3</cell></row><row><cell cols="2">_FUNNY_EYES_EMOJI_TAG =) =D =(</cell></row><row><cell>_HORIZ_EMOJI_TAG</cell><cell>*.* o.O ^._</cell></row><row><cell>RUDE_TALK_TAG</cell><cell>F*** stupid</cell></row><row><cell>_LAUGH_TAG</cell><cell>haha Lol eheheeh</cell></row><row><cell cols="2">_PUNCTUATION_ABUSE_TAG !! ????</cell></row><row><cell>_EXPRESSIONS_TAG</cell><cell>ops whoa whow</cell></row><row><cell>_SHARE_PIC_TAG</cell><cell>[pic]</cell></row><row><cell>_MENTION_TAG</cell><cell>@username</cell></row><row><cell>_HASHTAG_TAG</cell><cell>#Paris</cell></row><row><cell>_NEW_LINE_TAG</cell><cell>a new line in the tweet</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4 :</head><label>4</label><figDesc>Grid Search values for Gamma and CResults include the cross-validated-mean-score and the standard deviation. The best parameters are those which produce the highest mean and the lowest standard deviation. For example, (2) is the result which refers to the best parameters after doing the grid search over the PAN 16 dataset.</figDesc><table><row><cell></cell><cell>1</cell><cell>2</cell><cell>3</cell><cell>4</cell><cell>5</cell><cell>6</cell><cell>7</cell><cell>8</cell><cell>9</cell></row><row><cell cols="4">Gamma 0.0001 0.001 0.01</cell><cell>0.1</cell><cell>1</cell><cell>10</cell><cell>100</cell><cell cols="2">1000 10000</cell></row><row><cell>C</cell><cell cols="3">0.0001 0.001 0.01</cell><cell>0.1</cell><cell>1</cell><cell>10</cell><cell>100</cell><cell cols="2">1000 10000</cell></row><row><cell></cell><cell></cell><cell cols="6">Code 1: Grid Search pseudo-code</cell><cell></cell></row><row><cell cols="4">f o r e a c h _ c i n c _ l i s t :</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="6">f o r each_gamma i n g a m m a _ l i s t :</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell cols="2">r e s u l t s [ i ] =</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell cols="9">3− f o l d _ c r o s s _ v a l i d a t i o n ( e a c h _ c , each_gamma )</cell></row></table><note>kernel : rbf, gamma : 0.0001, C : 100</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 5 :</head><label>5</label><figDesc>Results for PAN 16 corpus with 10000 features</figDesc><table><row><cell></cell><cell cols="5">precision recall f1-score support kernel gamma c</cell></row><row><cell>class1</cell><cell>0.00</cell><cell>0.00 0.00</cell><cell>2</cell><cell></cell><cell></cell></row><row><cell>class2</cell><cell>0.22</cell><cell>0.15 0.18</cell><cell>26</cell><cell></cell><cell></cell></row><row><cell>class3</cell><cell>0.31</cell><cell>0.56 0.39</cell><cell>27</cell><cell></cell><cell></cell></row><row><cell>class4</cell><cell>0.25</cell><cell>0.08 0.12</cell><cell>12</cell><cell></cell><cell></cell></row><row><cell>class5 class6</cell><cell>0.00 0.00</cell><cell>0.00 0.00 0.00 0.00</cell><cell>1 2</cell><cell>rbf</cell><cell>0.0001 100</cell></row><row><cell>class7</cell><cell>0.43</cell><cell>0.46 0.44</cell><cell>26</cell><cell></cell><cell></cell></row><row><cell>class8</cell><cell>0.29</cell><cell>0.29 0.29</cell><cell>34</cell><cell></cell><cell></cell></row><row><cell>class9</cell><cell>0.33</cell><cell>0.21 0.26</cell><cell>14</cell><cell></cell><cell></cell></row><row><cell cols="2">avg / total 0.30</cell><cell>0.31 0.29</cell><cell>144</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 6 :</head><label>6</label><figDesc>Results for PAN 16 corpus with 100 features</figDesc><table><row><cell></cell><cell cols="5">precision recall f1-score support kernel gamma c</cell></row><row><cell>class1</cell><cell>0.00</cell><cell>0.00 0.00</cell><cell>2</cell><cell></cell><cell></cell></row><row><cell>class2</cell><cell>0.32</cell><cell>0.23 0.27</cell><cell>26</cell><cell></cell><cell></cell></row><row><cell>class3</cell><cell>0.34</cell><cell>0.67 0.45</cell><cell>27</cell><cell></cell><cell></cell></row><row><cell>class4</cell><cell>0.22</cell><cell>0.17 0.19</cell><cell>12</cell><cell></cell><cell></cell></row><row><cell>class5 class6</cell><cell>0.00 0.00</cell><cell>0.00 0.00 0.00 0.00</cell><cell>1 2</cell><cell>rbf</cell><cell>0.01 10</cell></row><row><cell>class7</cell><cell>0.43</cell><cell>0.35 0.38</cell><cell>26</cell><cell></cell><cell></cell></row><row><cell>class8</cell><cell>0.29</cell><cell>0.29 0.29</cell><cell>34</cell><cell></cell><cell></cell></row><row><cell>class9</cell><cell>0.14</cell><cell>0.07 0.10</cell><cell>14</cell><cell></cell><cell></cell></row><row><cell cols="2">avg / total 0.30</cell><cell>0.32 0.30</cell><cell>144</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 7 :</head><label>7</label><figDesc>Results for PAN 15 corpus with 10000 features</figDesc><table><row><cell></cell><cell cols="5">precision recall f1-score support kernel gamma c</cell></row><row><cell>class1</cell><cell>0.71</cell><cell>0.45 0.56</cell><cell>11</cell><cell></cell><cell></cell></row><row><cell>class2</cell><cell>0.88</cell><cell>0.58 0.70</cell><cell>12</cell><cell></cell><cell></cell></row><row><cell>class3</cell><cell>1.00</cell><cell>0.29 0.44</cell><cell>7</cell><cell></cell><cell></cell></row><row><cell>class4</cell><cell>0.00</cell><cell>0.00 0.00</cell><cell>3</cell><cell></cell><cell></cell></row><row><cell>class5 class6</cell><cell>0.55 0.14</cell><cell>0.75 0.63 1.00 0.25</cell><cell>8 3</cell><cell>rbf</cell><cell>0.0001 100</cell></row><row><cell>class7</cell><cell>1.00</cell><cell>0.67 0.80</cell><cell>3</cell><cell></cell><cell></cell></row><row><cell>class8</cell><cell>0.00</cell><cell>0.00 0.00</cell><cell>4</cell><cell></cell><cell></cell></row><row><cell cols="2">avg / total 0.65</cell><cell>0.49 0.51</cell><cell>51</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_9"><head>Table 8 :</head><label>8</label><figDesc>Results for PAN 15 corpus with 100 features</figDesc><table><row><cell></cell><cell cols="6">Precision Recall F1-score Support Kernel Gamma C</cell></row><row><cell>Class 1</cell><cell>0.71</cell><cell>0.45 0.56</cell><cell>11</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Class 2</cell><cell>0.65</cell><cell>0.92 0.76</cell><cell>12</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Class 3</cell><cell>0.67</cell><cell>0.29 0.40</cell><cell>7</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Class 4</cell><cell>0.00</cell><cell>0.00 0.00</cell><cell>3</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Class 5 Class 6</cell><cell>0.60 0.18</cell><cell>0.75 0.67 0.67 0.29</cell><cell>8 3</cell><cell>rbf</cell><cell>0.01</cell><cell>10</cell></row><row><cell>Class 7</cell><cell>1.00</cell><cell>0.33 0.50</cell><cell>3</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Class 8</cell><cell>1.00</cell><cell>0.25 0.40</cell><cell>4</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">Avg / Total 0.64</cell><cell>0.55 0.54</cell><cell>51</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://pan.webis.de/clef16/pan16-web/author-profiling.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">Corpus available in http://pan.webis.de/clef16/pan16-web/author-profiling.html</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>We want to address our thanks to the Departamento de Informática da Escola de Ciências e Tecnologia da Universidade de Évora, for all the support to our work.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Nltk: the natural language toolkit</title>
		<author>
			<persName><forename type="first">S</forename><surname>Bird</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the COLING/ACL on Interactive presentation sessions</title>
				<meeting>the COLING/ACL on Interactive presentation sessions</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="69" to="72" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">TIRA: Configuring, Executing, and Disseminating Information Retrieval Experiments</title>
		<author>
			<persName><forename type="first">T</forename><surname>Gollub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Burrows</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hoppe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">9th International Workshop on Text-based Information Retrieval (TIR 12) at DEXA</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Tjoa</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Liddle</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><forename type="middle">D</forename><surname>Schewe</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Zhou</surname></persName>
		</editor>
		<meeting><address><addrLine>Los Alamitos, California</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2012-09">Sep 2012</date>
			<biblScope unit="page" from="151" to="155" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Support vector machines</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hearst</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Dumais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Osman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Platt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Scholkopf</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Intelligent Systems and their Applications</title>
				<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="18" to="28" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">A practical guide to support vector classification</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">W</forename><surname>Hsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Lin</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Naive (bayes) at forty: The independence assumption in information retrieval</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Lewis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine learning: ECML-98</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="4" to="15" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Development of a stemming algorithm</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">B</forename><surname>Lovins</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1968">1968</date>
		</imprint>
		<respStmt>
			<orgName>MIT Information Processing Group, Electronic Systems Laboratory Cambridge</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Scikit-learn: Machine learning in Python</title>
		<author>
			<persName><forename type="first">F</forename><surname>Pedregosa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Varoquaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gramfort</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Michel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Thirion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Grisel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Blondel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Prettenhofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Weiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Dubourg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Vanderplas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Passos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cournapeau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brucher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Perrot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Duchesnay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="2825" to="2830" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Porter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Boulton</surname></persName>
		</author>
		<ptr target="Visited25/02/2016" />
		<title level="m">Snowball. On line</title>
				<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Improving the Reproducibility of PAN&apos;s Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gollub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Lupu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Clough</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Sanderson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Hall</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Hanbury</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Toms</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014-09">Sep 2014</date>
			<biblScope unit="page" from="268" to="299" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Using tf-idf to determine word relevance in document queries</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ramos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the first instructional conference on machine learning</title>
				<meeting>the first instructional conference on machine learning</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Overview of the 4th Author Profiling Task at PAN 2016: Cross-genre Evaluations</title>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Verhoeven</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes Papers of the CLEF 2016 Evaluation Labs</title>
		<title level="s">CEUR Workshop Proceedings, CLEF and CEUR-WS</title>
		<imprint>
			<date type="published" when="2016-09">Sep 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">The peaking phenomenon in the presence of feature-selection</title>
		<author>
			<persName><forename type="first">C</forename><surname>Sima</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">R</forename><surname>Dougherty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition Letters</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page" from="1667" to="1674" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Cheap and fast-but is it good?: evaluating non-expert annotations for natural language tasks</title>
		<author>
			<persName><forename type="first">R</forename><surname>Snow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>O'connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the conference on empirical methods in natural language processing</title>
				<meeting>the conference on empirical methods in natural language processing</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="254" to="263" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Understanding bag-of-words model: a statistical framework</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">H</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Machine Learning and Cybernetics</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">1-4</biblScope>
			<biblScope unit="page" from="43" to="52" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
