<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Gender and language-variety identification with MicroTC Notebook for PAN at CLEF 2017</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Eric</forename><forename type="middle">S</forename><surname>Tellez</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">CONACyT-INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación</orgName>
								<address>
									<settlement>México</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sabino</forename><surname>Miranda-Jiménez</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">CONACyT-INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación</orgName>
								<address>
									<settlement>México</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mario</forename><surname>Graff</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">CONACyT-INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación</orgName>
								<address>
									<settlement>México</settlement>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Daniela</forename><surname>Moctezuma</surname></persName>
							<email>dmoctezuma@centrogeo.edu.mx</email>
							<affiliation key="aff1">
								<orgName type="laboratory">CONACyT-CentroGEO Centro de Investigación en</orgName>
								<orgName type="institution">Geografía y Geomática &quot;Ing</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jorge</forename><forename type="middle">L</forename><surname>Tamayo</surname></persName>
						</author>
						<title level="a" type="main">Gender and language-variety identification with MicroTC Notebook for PAN at CLEF 2017</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">5A15CA6BBBC314714E11AF393362179B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:28+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this notebook, we describe our approach to cope with the Author Profiling task on PAN17 which consists of both gender and language identification for Twitter's users. We used our MicroTC (µTC) framework as the primary tool to create our classifiers. µTC follows a simple approach to text classification; it converts the problem of text classification to a model selection problem using several simple text transformations, a combination of tokenizers, a term-weighting scheme, and finally, it classifies using a Support Vector Machine. Our approach reaches accuracies of 0.7838, 0.8054, 0.7957, and 0.8538, for gender identification; and for language variety, it achieves 0.8275, 0.9004, 0.9554, and 0.9850. All these, for Arabic, English, Spanish, and Portuguese languages, respectively.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Recently, forensic text analysis about originality, authorship, and reliability has attracted a lot of attention by researchers and practitioners because of practical applications in security and marketing <ref type="bibr" target="#b18">[19]</ref>. In this context, author profiling is an important task of PAN@CLEF forum that focuses on analyzing some characteristics of the author (profiling aspects) based on the written author's text, such as gender, age, political preferences, personality, language variety, among others <ref type="bibr" target="#b13">[14]</ref>.</p><p>Generally speaking, author profiling task is tackled using, mainly, machinelearning approaches, i.e., models, for predicting profiling aspects, are built considering a set of general features that represent different categories of authors, e.g., gender, range age, and language variety, among others <ref type="bibr" target="#b15">[16]</ref>.</p><p>PAN forum 2017 3 provides a dataset of tweets for training and test the performance of each participating system. In this edition, the profiling aspects to be analyzed are gender and language of Twitter's users. The corpus is annotated with authors' gender and their particular variation of their mother tongue that includes Arabic, English, Spanish, and Portuguese.</p><p>Our approach is language independent, that is, we deliberately avoid the use of linguistic procedures such as part-of-speech tagging, lemmatization, or stemming. In the same way, linguistic resources, like lexicons and WordNetbased, are disallowed. In contrast, we take advantage of multiple tokenizers, an entropy-based term-weighting scheme, and an SVM classifier, see Section 3 for details.</p><p>The rest of the paper is organized as follows. Section 2 presents few of the gender, age, language, and region identification related works, and Section 3 describes our system and the general approach to model the problem. Section 4 detail the experimental methodology and the achieved results. Finally, conclusions and future work are given in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related work</head><p>Author profiling is a repetitive and important task in PAN contest since 2013 <ref type="bibr" target="#b15">[16]</ref>. Before 2017 edition, only age and gender classification tasks were considered <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref>. This year, the PAN considers the region aspect while removes the age identification subtask from the competition <ref type="bibr" target="#b13">[14]</ref>.</p><p>Several works have been proposed to solve age and gender identification subtasks. Agrawal &amp; Gonçalves <ref type="bibr" target="#b0">[1]</ref> use a combination of classifiers along with a model based on user's activities to predict the profile of the unknown users. The TFIDF representation was employed, and a dimension reduction was performed in this matrix. The authors use Naive Bayes and Linear SVM as classifiers.</p><p>With the purpose to find the differences between writing styles of males and females in different age groups, the usage of several stylometric features is considered in <ref type="bibr" target="#b3">[4]</ref>. Another stylometric approach was presented in <ref type="bibr" target="#b4">[5]</ref> where two groups of features were considered, trigrams and complementary-weighted Second Order Attributes. An SVM classifier is used in the classification step. A combination of features based on word n-grams, sentences starting with capital letters, finish the sentences with a dot, emoticons, word's length and sentence's length is also used along with grammatical aspects are explored in <ref type="bibr" target="#b22">[23]</ref>.</p><p>Lopez-Monroy et al. <ref type="bibr" target="#b11">[12]</ref> propose a representation for documents that capture discriminative and subprofile-specific information of terms. Under the proposed representation, terms are represented in a vector space that captures discriminative information. On the other hand, more traditional representations, like TFIDF, are broadly employed in the author's profiling literature, that is the case of <ref type="bibr" target="#b7">[8]</ref>, <ref type="bibr" target="#b21">[22]</ref>, and <ref type="bibr" target="#b12">[13]</ref>. Classification ensembles are also frequently used; for instance, <ref type="bibr" target="#b23">[24]</ref> generate several classifiers using sets of features such as word ngram, character n-gram, and part-of-speech n-gram features.</p><p>Language variety identification is a new subtask introduced in PAN17 that consists in determining the specific variation of the native language of authors' text <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b9">10]</ref>. Another approach to region classification is presented in <ref type="bibr" target="#b6">[7]</ref> where twitter geolocation and regional classification was conducted through sparse coding and dictionary learning. Another region prediction approach based on Modified Adsorption, removing "celebrity" nodes and analyzing a graph model propagation is proposed in <ref type="bibr" target="#b14">[15]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">System description</head><p>MicroTC (µTC) is a generic framework for text classification task, i.e., it works regardless of both domain and language particularities. µTC is an extension of our previous work on sentiment analysis, see <ref type="bibr" target="#b19">[20]</ref>. A full description of µTC can be found in <ref type="bibr" target="#b20">[21]</ref>. The core idea behind µTC is to tackle a text classification task by selecting an appropriate configuration from a set of different text transformations techniques, tokenizers, and several weighting schemes, using as a classifier a Support Vector Machine (SVM) with linear kernel. In some sense, the text classification problem is transformed into hyper-parameter optimization, also known as model selection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">About µTC</head><p>Briefly, µTC contains the following parts: i) a list of functions that normalize and transform the input text to the input of tokenizers (preprocessing), ii) a set of tokenizer functions that transform the filtered text into a multiset of tokens, iii) a function that generates weighted vectors from the multiset of tokens; and finally, iv) a classifier that knows how to assign a label to a given vector.</p><p>i. Preprocessing functions We use trivalent and binary parameters. The trivalent values can be set to {remove, group, none} which means that the term matching the parameter is removed, grouped in set of predefined classes, or left untouched. In this kind of parameters, µTC contains handlers for hashtags, numbers, urls, users, and emoticons. The binary parameters are boolean, and basically, indicate if the parameter is activated or not. In this parameter set, we support for diacritic removal, character duplication removal, punctuation removal, and case normalization. ii. Tokenizers After all text normalization and transformation, a list of tokens should be extracted. We allow to use n-grams of words (n = 1, 2, 3), q-grams of characters (q = 1, 3, 5, 7, 9), and skip-grams. For skip-grams we allow to select a few tokenizers like two words with gap one, (2, 1), also we allow to use (2, 2), <ref type="bibr" target="#b2">(3,</ref><ref type="bibr" target="#b0">1)</ref>. Instead of selecting one or another tokenizer scheme, we allow to select any combination of the available tokenizers, and perform the union of the final multisets of tokens. iii. Weighting schemes. After we obtained a multiset (bag of tokens) from the tokenizers, we must create a vector space. MicroTC allows to use the raw frequency and the TFIDF scheme to weight the coordinates of the vector. It contains a number of frequency filters that were deactivated for this contribution, see <ref type="bibr" target="#b20">[21]</ref> for more details.</p><p>iv. Classifier We decide to use a singleton set populated with an SVM with a linear kernel. It is well known that SVM performs excellently for very large dimensional input (which is our case), and the linear kernel also performs well under this conditions. We do not optimize the parameters of the classifier since we are pretty interested in the rest of the process. We use the SVM classifier from liblinear, Fan et al. <ref type="bibr" target="#b8">[9]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Modeling users</head><p>We select to model a user using all its tweets, that is, an user u is a collection of small texts u = {t 1 , . . . , t n }. For each text, we apply the preprocessing step and tokenizers, then we create a multiset from the union of all multisets in u. After this, a vector u is created using a term weighting scheme. Thus, we modeled each user as a high dimensional sparse vector. For instance, since we do not remove any kind of terms, and in fact we promote the usage of combinations of tokenizers, the user's vectors can contain millions of coordinates, and thousand non-zero entries.</p><p>The weighting schemes for this modeling are described in the following paragraphs. We also introduce entropy+b, a new weighting scheme introduced in this notebook designed for classification tasks. In the following paragraphs we describe in detail the weighting schemes used in the experimental section.</p><p>The simpler scheme corresponds to freq, and it is defined as the term frequency of each term per user; we name it freq usr to avoid confusion with other functions. TFIDF is the product of TF and IDF where TF is the normalized frequency of a user's term, and IDF is the inverse document frequency defined as the logarithm of the inverse of the probability that a term occurs in the whole collection of users, more precisely, TF(w, usr) = freq usr (w) max w∈usr {freq usr (w)} , and</p><formula xml:id="formula_0">IDF(w) = log N |{usr | freq usr (w) &gt; 0}| ,</formula><p>where N is the size of the training collection, i.e., the number of users. It is common to add 1 to the denominator expression to avoid numerical problems.</p><p>In this notebook, we introduce the entropy+b term-weighting that considers that each term is represented by a distribution over the available classes. Instead of using the raw probabilities per class, we weight each term with the Entropy+b function, defined as follows: Here, freq c denotes the frequency of the given term in the class c. The idea behind entropy b (w) is to weight each term using the entropy of the underlying distribution in a way that large entropy values (terms uniformly distributed along all classes) have a low weight while terms being skewed to some class are close to log |C|. The parameter b is introduced to absorb the possible noise that occurs in low populated terms.</p><formula xml:id="formula_1">entropy b (w) = log |C| −</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">About the model selection</head><p>The model selection is lead by a performance function score that is maximized (solved) by a meta-heuristic. The only assumption is that score slowly varies on similar configurations, such that we can assume some degree of locally concaveness, in the sense that a local maximum can be reached using greedy decisions at some given point. Clearly, this is not true in general and the solver algorithm should be robust enough to get a good approximation even when the assumption is valid only with some degree of certainty. From a practical point of view, a configuration is similar to another if structurally vary in a single parameter. We name the set of all similar configurations of m as its neighborhood. Therefore, the core idea is to start from a set of random configurations, evaluate their neighborhoods and greedily move to the most promising set of configurations, The procedure is repeated until some condition is achieved, like the impossibility of improve the score function, or when a maximum number of iterations is reached. There are several meta-heuristics to solve combinatorial optimization problems, the proper survey of the area is beyond the scope of this notebook; however, the interested reader is referred to <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b1">2]</ref>.</p><p>In particular, µTC uses two types of meta-heuristics, Random Search <ref type="bibr" target="#b2">[3]</ref> and Hill Climbing <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b1">2]</ref> algorithms. The former consists in randomly sampling C and selecting the best configuration among that sample. Given a pivoting configuration, the main idea behind Hill Climbing is to explore the configuration's neighborhood and greedily move to the best neighbor. The process is repeated until no improvement is possible. We improve the whole optimization process applying a Hill Climbing procedure over the best configuration found by a Random Search. We also add memory to avoid a configuration to be evaluated twice <ref type="foot" target="#foot_1">4</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments and results</head><p>The experiments with the training set were run in an Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz with 32 threads and 192 GiB of RAM running CentOS 7.1 Linux. The gold-standard were evaluated in the TIRA platform using a virtual machine with 4GiB of RAM and one core. We implemented µTC<ref type="foot" target="#foot_2">5</ref> on Python.</p><p>We partition the full training dataset into two smaller sets, a new training set containing 30% of the users, and a validation set with the resting 70%. The partition where selected to ensure the generalization of our scheme. On the new training set, from now on just training set, we run µTC using random search and hill climbing to perform the hyper-parameter optimization. Random search was allowed to select 32 random configurations. On the other hand, Hill-climbing starts with the best configuration found by random search; the procedure was left to finish its optimization process. We use 3-fold cross validation for the model selection procedure. Once the model selection finished, we use the configuration found to train a µTC machine with the whole (small) training set and measure the performance of that classifier in the validation set.</p><p>Table <ref type="table">1</ref>. Performance of our approaches for gender using 30−70% partition for training and test datasets.</p><p>Table <ref type="table">1</ref> shows the performance of µTC for gender identification. In particular, we show macro-recall, macro-f1, and accuracy scores. We show three different term-weighting schemes, detailed in §3. We select the FREQ scheme to describe the improvement of each scheme. The FREQ and T F IDF schemes are implemented in µTC; for entropy+b, we show the performance for five different values of b. Table <ref type="table">1</ref> indicates that T F IDF performs poorly as compared with FREQ. Entropy+b illustrates the dependency of b, showing better performances for small b values, except b = 0 which has a poor performance for gender identification. The table shows that b = 3 and b = 10 performs much better than the rest of the classifiers. Between entropy+3 and entropy+10, the first one performs better; however, entropy+3 was evaluated after the deadline of the second run. Therefore, entropy+10 was used to classify the gold standard, see Table <ref type="table" target="#tab_1">3</ref>. Table <ref type="table" target="#tab_0">2</ref> shows the performance of our systems in the language variety task. As before, we use FREQ as the baseline method. In this task, FREQ also performs The official performances on the PAN17 gold standard are shown in Table <ref type="table" target="#tab_1">3</ref>. We send our baseline based on the FREQ weighting scheme and the profiler based on entropy+10. The table indicates the accuracy for gender and variety tasks, as well for the joint accuracy (the same example was correctly predicted in both tasks). As predicted in Tables <ref type="table" target="#tab_0">1 and 2</ref>, entropy+10 has a better performance than FREQ, in some languages by a large margin, e.g., close to five percentual points for Arabic, and six percentual points for Portuguese.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusions</head><p>In this notebook, we describe the INGEOTEC's system used to solve the Author Profiling task in PAN17. We used our MicroTC (µTC) framework <ref type="bibr" target="#b20">[21]</ref> as the primary tool to create our classifiers. µTC follows a simple approach to text classification; it converts the problem of text classification to a model selection problem using several simple text transformations, a combination of tokenizers, a term-weighting scheme, and an SVM classifier. It is designed to tackle textclassification problems in an agnostic way, being both domain and language independent.</p><p>To effectively tackle the task, we introduce a new term-weighting scheme based on the distributional representation of each term and the entropy over that distribution. We call it entropy+b. More work is needed to characterize the new weighting scheme yet it demonstrated to be superior to raw term frequency and TFIDF, at least, for the Author Profiling task and our µTC framework.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>c∈C p c (w, b) log 1 p c (w, b) , where C is the set of classes, and p c (w, b) is the probability of term w in class c parametrized with b. More detailed, p c (w, b) = freq c (w) b • |C| + c∈C freq c (w).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2 .</head><label>2</label><figDesc>Performance of our approaches for language variety using 30 − 70% partition for training and test datasets.</figDesc><table><row><cell>name</cell><cell cols="4">macro-recall macro-f1 accuracy improvement</cell></row><row><cell></cell><cell></cell><cell>Arabic</cell><cell></cell><cell></cell></row><row><cell>µTC-FREQ</cell><cell>0.7577</cell><cell>0.7594</cell><cell>0.7581</cell><cell>-</cell></row><row><cell>µTC-TFIDF</cell><cell>0.7488</cell><cell>0.7488</cell><cell>0.7488</cell><cell>↓1.24%</cell></row><row><cell>µTC-entropy+0</cell><cell>0.8088</cell><cell>0.8098</cell><cell>0.8094</cell><cell>↑6.76%</cell></row><row><cell>µTC-entropy+3</cell><cell>0.8111</cell><cell>0.8118</cell><cell>0.8113</cell><cell>↑7.01%</cell></row><row><cell>µTC-entropy+10</cell><cell>0.8039</cell><cell>0.8047</cell><cell>0.8044</cell><cell>↑6.10%</cell></row><row><cell>µTC-entropy+30</cell><cell>0.8164</cell><cell>0.8169</cell><cell>0.8169</cell><cell>↑7.75%</cell></row><row><cell>µTC-entropy+100</cell><cell>0.8070</cell><cell>0.8073</cell><cell>0.8075</cell><cell>↑6.51%</cell></row><row><cell></cell><cell></cell><cell>English</cell><cell></cell><cell></cell></row><row><cell>µTC-FREQ</cell><cell>0.7834</cell><cell>0.7839</cell><cell>0.7833</cell><cell>-</cell></row><row><cell>µTC-TFIDF</cell><cell>0.7960</cell><cell>0.7957</cell><cell>0.7960</cell><cell>↑1.62 %</cell></row><row><cell>µTC-entropy+0</cell><cell>0.8901</cell><cell>0.8902</cell><cell>0.8900</cell><cell>↑13.62%</cell></row><row><cell>µTC-entropy+3</cell><cell>0.8918</cell><cell>0.8921</cell><cell>0.8917</cell><cell>↑13.83%</cell></row><row><cell>µTC-entropy+10</cell><cell>0.8784</cell><cell>0.8787</cell><cell>0.8783</cell><cell>↑12.13%</cell></row><row><cell>µTC-entropy+30</cell><cell>0.8683</cell><cell>0.8687</cell><cell>0.8683</cell><cell>↑10.85%</cell></row><row><cell>µTC-entropy+100</cell><cell>0.8645</cell><cell>0.8649</cell><cell>0.8646</cell><cell>↑10.37%</cell></row><row><cell></cell><cell></cell><cell>Spanish</cell><cell></cell><cell></cell></row><row><cell>µTC-FREQ</cell><cell>0.9020</cell><cell>0.9022</cell><cell>0.9018</cell><cell>-</cell></row><row><cell>µTC-TFIDF</cell><cell>0.8948</cell><cell>0.8947</cell><cell>0.8954</cell><cell>↓0.71%</cell></row><row><cell>µTC-entropy+0</cell><cell>0.9573</cell><cell>0.9573</cell><cell>0.9571</cell><cell>↑6.14%</cell></row><row><cell>µTC-entropy+3</cell><cell>0.9537</cell><cell>0.9537</cell><cell>0.9536</cell><cell>↑5.74%</cell></row><row><cell>µTC-entropy+10</cell><cell>0.9437</cell><cell>0.9437</cell><cell>0.9436</cell><cell>↑4.63%</cell></row><row><cell>µTC-entropy+30</cell><cell>0.9272</cell><cell>0.9269</cell><cell>0.9268</cell><cell>↑2.77%</cell></row><row><cell>µTC-entropy+100</cell><cell>0.9109</cell><cell>0.9109</cell><cell>0.9107</cell><cell>↑0.99%</cell></row><row><cell></cell><cell></cell><cell>Portuguese</cell><cell></cell><cell></cell></row><row><cell>µTC-FREQ</cell><cell>0.9815</cell><cell>0.9812</cell><cell>0.9813</cell><cell>-</cell></row><row><cell>µTC-TFIDF</cell><cell>0.9737</cell><cell>0.9737</cell><cell>0.9738</cell><cell>↓0.76%</cell></row><row><cell>µTC-entropy+0</cell><cell>0.9852</cell><cell>0.9850</cell><cell>0.9850</cell><cell>↑0.38%</cell></row><row><cell>µTC-entropy+3</cell><cell>0.9901</cell><cell>0.9900</cell><cell>0.9900</cell><cell>↑0.89%</cell></row><row><cell>µTC-entropy+10</cell><cell>0.9852</cell><cell>0.9850</cell><cell>0.9850</cell><cell>↑0.38%</cell></row><row><cell>µTC-entropy+30</cell><cell>0.9876</cell><cell>0.9875</cell><cell>0.9875</cell><cell>↑0.64%</cell></row><row><cell>µTC-entropy+100</cell><cell>0.9852</cell><cell>0.9850</cell><cell>0.9850</cell><cell>↑0.38%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 3 .</head><label>3</label><figDesc>Performance of our approaches for language's variety in the official PAN17's gold-standard using µTC with two different term-weighting schemes. , excepting for English; both approaches are part of the µTC tool. The entropy+b scheme is much better for almost any of the presented b's, even for b = 0. As in the gender identification task, the smaller values of b perform better than larger values, achieving the best performance when b = 3. Nonetheless, we used entropy+10 to classify the gold standard because the deadline hit us.</figDesc><table><row><cell>name</cell><cell cols="2">language gender variety</cell><cell>joint</cell></row><row><cell></cell><cell></cell><cell cols="2">accuracy accuracy accuracy</cell></row><row><cell>µTC-FREQ</cell><cell>ar</cell><cell cols="2">0.7569 0.7925 0.6125</cell></row><row><cell>µTC-entropy+10</cell><cell>ar</cell><cell cols="2">0.7838 0.8275 0.6713</cell></row><row><cell>µTC-FREQ</cell><cell>en</cell><cell cols="2">0.7938 0.8388 0.6704</cell></row><row><cell>µTC-entropy+10</cell><cell>en</cell><cell cols="2">0.8054 0.9004 0.7267</cell></row><row><cell>µTC-FREQ</cell><cell>es</cell><cell cols="2">0.7975 0.9364 0.7518</cell></row><row><cell>µTC-entropy+10</cell><cell>es</cell><cell cols="2">0.7957 0.9554 0.7621</cell></row><row><cell>µTC-FREQ</cell><cell>pt</cell><cell cols="2">0.8038 0.9750 0.7850</cell></row><row><cell>µTC-entropy+10</cell><cell>pt</cell><cell cols="2">0.8538 0.9850 0.8425</cell></row><row><cell>better than TFIDF</cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">http://pan.webis.de/clef17/pan17-web/author-profiling.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">In principle, this is similar to Tabu search; however, our implementation is simpler than a typical implementation of Tabu search.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">Available under Apache 2 license at https://github.com/INGEOTEC/microTC</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>We like to thank the PAN organizers, in particular to Francisco Rangel and Martin Potthast for their kind and quick response to our questions and requests.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0" />			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Age and gender identification using stacking for classification. notebook for pan at clef</title>
		<author>
			<persName><forename type="first">M</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gonçalves</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page">2016</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Battiti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brunato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Mascia</surname></persName>
		</author>
		<title level="m">Reactive search and intelligent optimization</title>
				<imprint>
			<publisher>Springer Science &amp; Business Media</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="volume">45</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Random search for hyper-parameter optimization</title>
		<author>
			<persName><forename type="first">J</forename><surname>Bergstra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="281" to="305" />
			<date type="published" when="2012-02">Feb. 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Caps: A cross-genre author profiling system</title>
		<author>
			<persName><forename type="first">I</forename><surname>Bilan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhekova</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Author profiling using complementary second order attributes and stylometric features</title>
		<author>
			<persName><forename type="first">K</forename><surname>Bougiatiotis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Krithara</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Search methodologies</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">K</forename><surname>Burke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kendall</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2005">2005</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Twitter geolocation and regional classification via sparse coding</title>
		<author>
			<persName><forename type="first">M</forename><surname>Cha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Gwon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICWSM</title>
		<imprint>
			<biblScope unit="page" from="582" to="585" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Using machine learning algorithms for author profiling in social media</title>
		<author>
			<persName><forename type="first">D</forename><surname>Dichiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Rancea</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Liblinear: A library for large linear classification</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">E</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Hsieh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">R</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of machine learning research</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="1871" to="1874" />
			<date type="published" when="2008-08">Aug. 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A low dimensionality representation for language variety identification</title>
		<author>
			<persName><forename type="first">Francisco</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marc</forename><surname>Franco-Salvador</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">R</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Postproc. 17th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2016</title>
				<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Language variety identification using distributed representations of words and documents</title>
		<author>
			<persName><forename type="first">M</forename><surname>Franco-Salvador</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Taulé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Martít</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference of the Cross-Language Evaluation Forum for European Languages</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="28" to="40" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Discriminative subprofile-specific representations for author profiling in social media</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>López-Monroy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Gómez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">J</forename><surname>Escalante</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Villaseñor-Pineda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<ptr target="http://www.sciencedirect.com/science/article/pii/S0950705115002427" />
	</analytic>
	<monogr>
		<title level="j">Knowledge-Based Systems</title>
		<imprint>
			<biblScope unit="volume">89</biblScope>
			<biblScope unit="page" from="134" to="147" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Adapting cross-genre author profiling to language and corpus</title>
		<author>
			<persName><forename type="first">I</forename><surname>Markov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gómez-Adorno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sidorov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gelbukh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the CLEF</title>
				<meeting>the CLEF</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="947" to="955" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Overview of PAN&apos;17: Author Identification, Author Profiling, and Author Obfuscation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tschuggnall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Initiative (CLEF 17)</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Jones</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Lawless</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Gonzalo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017-09">Sep 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Twitter user geolocation using a unified text and network prediction model</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rahimi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Cohn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Baldwin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1506.08259</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter</title>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes Papers of the CLEF 2017 Evaluation Labs</title>
		<title level="s">CEUR Workshop Proceedings, CLEF and CEUR-WS</title>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2017-09">Sep 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Overview of the 3rd author profiling task at pan 2015</title>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>CLEF</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Overview of the 4th author profiling task at pan 2016: cross-genre evaluations</title>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Verhoeven</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes Papers of the CLEF</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Overview of PAN&apos;16-New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tschuggnall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction. 7th International Conference of the CLEF Initiative (CLEF 16)</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Fuhr</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Quaresma</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Larsen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Gonçalves</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Balog</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Macdonald</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016-09">Sep 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A simple approach to multilingual polarity classification in twitter</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">S</forename><surname>Tellez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Miranda-Jiménez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Graff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Moctezuma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">R</forename><surname>Suárez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">S</forename><surname>Siordia</surname></persName>
		</author>
		<ptr target="http://www.sciencedirect.com/science/article/pii/S0167865517301721" />
	</analytic>
	<monogr>
		<title level="m">Pattern Recognition Letters</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">An automated text categorization framework based on hyperparameter optimization</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">S</forename><surname>Tellez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Moctezuma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Miranda-Jímenez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Graff</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1704.01975</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J G</forename><surname>Ucelay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Villegas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">G</forename><surname>Funez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">C</forename><surname>Cagnina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Errecalde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ramırez-De-La Rosa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Villatoro-Tello</surname></persName>
		</author>
		<title level="m">Profile-based approach for age and gender identification</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">B</forename><surname>Op Vollenbroek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Carlotto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kreutz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Medvedeva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Pool</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bjerva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Haagsma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nissim</surname></persName>
		</author>
		<title level="m">Gronup: Groningen user profiling</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Zahid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sampath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Farnadi</surname></persName>
		</author>
		<title level="m">Cross-genre age and gender identification in social media</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
