<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">UniNE at CLEF 2017: Author Profiling Reasoning Notebook for PAN at CLEF 2017</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mirco</forename><surname>Kocher</surname></persName>
							<email>mirco.kocher@unine.ch</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Dept</orgName>
								<orgName type="institution">University of Neuchâtel</orgName>
								<address>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jacques</forename><surname>Savoy</surname></persName>
							<email>jacques.savoy@unine.ch</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Dept</orgName>
								<orgName type="institution">University of Neuchâtel</orgName>
								<address>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">UniNE at CLEF 2017: Author Profiling Reasoning Notebook for PAN at CLEF 2017</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">1425388A03A181A24FED973F65FE349E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:31+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper describes and evaluates a supervised author profiling model. The suggested strategy can be adapted without any problem to various languages (such as Arabic, English, Spanish, and Portuguese). As features, we suggest using the m most frequent terms of the query text (isolated words and punctuation symbols with m at most 200). Applying a simple distance measure and looking at the nearest text profiles, we can determine the gender (with the nominal values "male" or "female") and the language variety (e.g., in Spanish the nominal values "Argentina", "Chile", "Colombia", "Mexico", "Peru", "Spain", or "Venezuela"). The training and test data is available for Twitter tweets (PAN AUTHOR PROFILING task at CLEF 2017). An analysis of the top ranked terms from a feature selection method allows a better understanding of the proposed assignments and presents typical writing styles for each category.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Social network applications produce a big amount of information (e.g., texts, pictures, videos, and links) at an unprecedented scale. Texts shared on such sites like Facebook and Twitter have their own characteristics vastly different from essays, literary texts, or newspaper articles. This is because anybody can publish unrevised content and the compulsion of having a fast interaction. We can observe a large variability related to spelling and grammar. Moreover, new terms tend to appear and emoji are used frequently to denote the author's emotions or state of mind.</p><p>The central question is, if we can detect writings by the author's gender from those sources, and what are the significant differences between man and women in their writing style. Similarly, can we detect the features that best discriminate different writings by different language varieties? The spelling difference between British English and American English is well defined, but can we detect a variation from the US to Canada, or Ireland and Great Britain, and can we discriminate between New Zealand and Australia? Furthermore, since profiling is based on Twitter tweets, the spelling may not always be perfect, and more sociocultural traits could be detected. There are some other interesting problems emerging from blogs and social networks such as detecting plagiarism, recognizing stolen identities, or rectifying wrong information about the writer. Therefore, proposing an effective algorithm to the profiling problem presents an indisputable interest.</p><p>These author profiling questions can be transformed to authorship attribution questions with a closed set of possible answers. Determining the gender of an author can be seen as attributing the text in question to either the male or female authors. Similarly, the language variety detection takes one of seven groups to attribute an unknown Spanish text.</p><p>This paper is organized as follows. The next section presents the test collections and the evaluation methodology used in the experiments. The third section explains our proposed algorithm. Then, we evaluate the proposed scheme and compare it to the best performing schemes using four different test collections. In the last section, we explain the decisions taken and extract typical writing styles for each category. A conclusion draws the main findings of this study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Test Collections and Evaluation Methodology</head><p>The experiments supporting previous studies were usually limited to custom corpora. To evaluate the effectiveness of different profiling algorithms, the number of tests must be large and run on a common test set. To create such benchmarks, and to promote studies in this domain, the PAN CLEF evaluation campaign was launched <ref type="bibr" target="#b5">[6]</ref>. Multiple research groups with different backgrounds from around the world have participated in the PAN CLEF 2017 campaign. Each team has proposed a profiling strategy that has been evaluated using the same methodology. The evaluation was performed using the TIRA platform, which is an automated tool for deployment and evaluation of the software <ref type="bibr" target="#b1">[2]</ref>. The data access is restricted such that during a software run the system is encapsulated and thus ensuring that there is no data leakage back to the task participants <ref type="bibr" target="#b4">[5]</ref>. This evaluation procedure also offers a fair evaluation of the time needed to produce an answer.</p><p>During the PAN CLEF 2017 evaluation campaign, three test collections were built. In this context, a problem is simply defined as:</p><p>Predict an author's language variety and gender from tweets. In each collection, all the texts matched the same language. The first benchmark is composed of an Arabic collection with the goal to predict four language varieties. The second is an English corpus containing six varieties, the third is written in Spanish and covers seven different varieties, while the last collection is in Portuguese based on two language varieties. In all corpora, the additional task is to determine the author's gender. The training data was collected from Twitter. This year, everyone had access to the test data twice. This means we can train and test a basic approach, improve it, and test it again for the second and final run.</p><p>An overview of these collections is depicted in Table <ref type="table" target="#tab_0">1</ref>. The number of samples from the training set is given under the label "Samples" (each sample is a set of tweets) and the mean number of tokens (isolated words and punctuation symbols) per sample is indicated under the label "Terms". A similar test set will then be used to be able to compare our results with those of the PAN CLEF 2017 campaign. That datasets remained mostly undisclosed due to the TIRA system so we don't have information about the average number of words per sample, but we expect a similar distribution.</p><p>When considering the four benchmarks, we have 11,400 profiles in total to train our system. When inspecting the distribution of the answers, we can find the same number (5,700 in training) of female and male profiles. In each of the individual test collections, we can also find a balanced number of female and male profiles. The same is the case for the language varieties, where each group has 600 samples. During the PAN CLEF 2017 campaign, a system must provide the answer for each problem in an XML structure. The response for the gender is a fixed binary choice and for the language variety one of the fixed entries is expected. The final performance measure is the joint accuracy of the gender and variety. This is the number of problems where both the gender and language variety are correctly predicted for the same problem divided by the number of problems in this corpus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Profiling Algorithm</head><p>To solve the profiling problem, we suggest a supervised approach based on a feature extraction and distance measure. The selected stylistic features correspond to the top m best terms (isolated words without stemming but with the punctuation symbols) calculated by the gain ratio formula as shown in Equation <ref type="formula" target="#formula_0">1</ref>.</p><formula xml:id="formula_0">𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜(𝑎, 𝑏, 𝑐, 𝑑) = 𝑎 𝑛 log 2 ( 𝑎 * 𝑛 (𝑎+𝑏) * (𝑎+𝑐) ) + 𝑐 𝑛 log 2 ( 𝑐 * 𝑛 (𝑎+𝑐) * (𝑐+𝑑) )<label>(1)</label></formula><p>where a, b, c, d, and n are used as indicated in Table <ref type="table" target="#tab_1">2</ref>. For instance, a represents the frequency of a given term ω (e.g., "the" or "people") in each class Γ (e.g., "female" or "Mexico") while d is the sum of all other terms in all other classes. For determining the number of useful features denoted m, previous studies have shown that a value between 200 and 300 tends to provide the best performance <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b6">7]</ref>. The Twitter tweets contained a lot of different hashtags (keyword preceded by a number sign) und numerous unique hyperlinks. To minimize the number of terms with a single occurrence we conflated all hashtags to a single feature and combined the morphological variants of Twitter links to another feature. The effective number of terms m was set to the 100 highest terms for each gender and 70 highest terms for each language variety. In the first run we also included the 10 lowest ranked terms as a counter indication for a given category, while this was omitted in the second run. Since there is some overlap when combining the highest ranked terms of one class with another, the length of the generated feature list was below 400 even for the Spanish collection containing seven different language classes. With this reduced number the justification of the decision will be simpler to understand because it will be based on words instead of letters, bigrams of letters, or combinations of several representation schemes or distance measures.</p><p>In the current study, a profiling problem is defined as a query text, denoted Q, containing a set of Twitter tweets. We then have multiple authors A with a known profile. To measure the distance between Q and A, in the first run we used a variant of the L 1 -norm called Canberra as shown in Equation <ref type="formula">2</ref>, while in the second run we used a variant of the L 2 norm called Clark as shown in Equation <ref type="formula" target="#formula_2">3</ref>:</p><formula xml:id="formula_1">∆ 𝐶𝑎𝑛𝑏𝑒𝑟𝑟𝑎 (𝑄, 𝐴) = ∑ |𝑃 𝑄 [𝑓 𝑖 ]−𝑃 𝐴 [𝑓 𝑖 ]| 𝑃 𝑄 [𝑓 𝑖 ]+𝑃 𝐴 [𝑓 𝑖 ] 𝑚 𝑖=1</formula><p>(2)</p><formula xml:id="formula_2">∆ 𝐶𝑙𝑎𝑟𝑘 (𝑄, 𝐴) = √∑ ( |𝑃 𝑄 [𝑓 𝑖 ]−𝑃 𝐴 [𝑓 𝑖 ]| 𝑃 𝑄 [𝑓 𝑖 ]+𝑃 𝐴 [𝑓 𝑖 ] ) 2 𝑚 𝑖=1<label>(3)</label></formula><p>where m indicates the number of terms (words or punctuation symbols), and PQ[ti] and PA[ti] represent the estimated occurrence probability of the term ti in the query text Q or in the author profile A respectively. To estimate these probabilities, we divide the term occurrence frequency (denoted tfi) by the length in tokens of the corresponding text (n), Prob[ti] = tfi / n. Due to the simple difference underlying the two Equations, we do not apply any smoothing procedure to our probability estimation.</p><p>To determine the gender and variety of Q we take the k nearest neighbors in the mdimensional vector space and use majority voting. In case there is a tie between multiple language varieties, we selected the nearest group among them. In the first run, the parameter k was set to k=9. In the second run we increased k to k=15 for the two smaller collections (Arabic and Portuguese) and set k=25 for the two bigger corpora (English and Spanish). This decision was taken because of the relatively large amount of data available, and to gain a more stable system less affected by outliers or the imperfection of Twitter tweets. A summarization of all parameters in the two runs is presented in Table <ref type="table" target="#tab_2">3</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Evaluation</head><p>Our system is based on a supervised approach and we could evaluate it using a modified leave-one-out approach on the training set. Instead of retrieving the k nearest neighbors, we returned k+1 candidates, but ignored the closest profile. The nearest sample was in fact the query text with a distance of zero and thus could also serve as a check of correctness. In Table <ref type="table" target="#tab_3">4a</ref> and Table <ref type="table" target="#tab_4">4b</ref>, we have reported the same performance measures applied during the PAN 2017 campaign, namely the joint accuracy of the gender and language variety. The algorithm clearly returns the best results for the Portuguese collection as a result of both the high gender detection accuracy and the high language variety prediction accuracy. With the leave-one-out approach and with the large size of all collections, we expect the results to be robust and a good prediction for the test dataset. The test set is then used to rank the performance of all 22 participants in the competition. Based on the same evaluation methodology, we achieve the results depicted in Table <ref type="table" target="#tab_5">5a</ref> and Table <ref type="table" target="#tab_6">5b</ref> corresponding to our two runs for all problems present in the four test collections. As we can see the joint scores on the test corpus are very similar to the training results. For the Arabic and English corpora, we can see a close resemblance to the corresponding results in the training collections. In the Spanish collection, the test performance is marginally higher (+3.5% change, +8.4% difference), while for the Portuguese dataset, the results are slightly lower (-2.8% change, -3.5% difference). Overall, the system seems to perform stable independent of the underlying text collection. This year, there were 22 participants and the task organizers provided 3 additional baselines 1 . To put our achieved performance values from Table <ref type="table" target="#tab_6">5b</ref> in perspective we can see in Table <ref type="table" target="#tab_7">6</ref> our results in comparison with the best participant, the three baselines, and the mean performance of all participations scores. The columns with the average gender score, the average language variety score, and the average joint score are each the mean over all four languages. The final overall value for the ranking is the mean of those three average values. Overall, we are at rank 16 2 ,which is above the average PAN scores and two of the provided baselines. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Decision Explanation</head><p>When analyzing the top ranked terms from the feature selection method between the two genders or the language variety groups we can obtain a better understanding of the proposed assignments. The gain ratio selects both features that are overly present in each category as well as features where it's rarity is a counterindication of a given category. Thus, the selected features are usually the same for both gender classes. To present typical features for each category individually, we use the Mutual Information for the terms in Table <ref type="table" target="#tab_8">7</ref>. This feature selection method assigns a high value only to the overused terms, which gives us a clearer differentiation <ref type="foot" target="#foot_0">3</ref> . In many cases, the different usage of geographical and topical terms can explain the decision for the classification. Some location related terms are for instance in Arabic ‫الكويت(‬ = Kuwait, ‫االردن‬ = Jordan, ‫طرابلس‬ = Tripoli, ‫الجزائر‬ = Algeria, ‫تونس‬ = Tunis, ‫التونسي‬ = Tunisia), in English (Canberra, Sydney, aust, Adelaide, jp, aus, Vancouver, Toronto, Edinburgh, Glasgow, Bristol, Dublin, Ireland, Belfast, Wellington, Auckland, nz, Zealand, Dunedin, DC), in Spanish (chilenos, Bogotá, Cali, Medellín, mx, Monterrey, Lima, peruano(s), Perú, Peru, Alcalá, Cataluña, Zulia, Caracas, venezolanos), and in Portuguese (Brasil, Portugal).</p><p>For topical words, we have different examples in Arabic ‫مدرب(‬ = coach; ‫الدوري‬ = league; ‫#صالة‬ = #Prayer), in English (NHL; makeup; Microsoft), in Spanish (lagos = lakes, forestales = forests, incendios = fires, viña = vineyard, medicinas = medicines), and in Portuguese (campeonato = championship, jogador = player, ranking).</p><p>Additionally, names of famous people in politics, music, and sports appear frequently, such as in Arabic ‫عايزة(‬ = Aiza), in Spanish (Zidane, Macri, Piñera, Duarte, Goya, Rajoy), in English (Turnbull, Abbott, Malcom, Reuters, Jedward, Byrne, Conor, Ethan), and in Portuguese (Eduardo).</p><p>Very frequent terms such as pronouns and determiners also appear in the top 10 highest ranked terms. There are examples in Arabic ‫إنى(‬ = I am; ‫انتى‬ = you; ‫ده‬ = this), in Spanish (nosotras = we; vos, os, vosotros = you), and in Portuguese (vc, você = you; tô = I am).</p><p>Furthermore, the frequent appearance of various heart shaped emoji in the female categories of Table <ref type="table" target="#tab_8">7</ref> in all four languages confirms previous findings that women tend to use more expressions related to social and emotion words than men <ref type="bibr" target="#b3">[4]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>This paper proposes a supervised technique to solve the author profiling problem. If a person's writing style may reveal his/her demographics we propose to characterize the style by considering terms (isolated words and punctuation symbols) selected using the gain ratio method. To take the profiling decision, we propose using the k nearest neighbors according to a distance measure based on the L 1 or L 2 norm. The proposed approach tends to perform very well in Portuguese Twitter tweets for both gender and language variety prediction. The performance of the gender detection in Arabic, English, and Spanish was acceptable, while the language variety classification was good considering the large number of categories. The final results on the test collections were as expected from the training corpora, indicating that no over-fitting occurred. Such a classifier strategy can be described as having a high bias but a low variance <ref type="bibr" target="#b2">[3]</ref>. Even if the proposed system cannot capture all possible stylistic features (bias), changing the available data does not modify significantly the overall performance (variance).</p><p>Moreover, the proposed profiling can be clearly explained because it is based on a reduced set of features on the one hand and, on the other, those features are words or punctuation symbols. Thus, the interpretation for the final user is clearer than when working with a huge number of features, when dealing with n-grams of letters or when combing several similarity measures. The decision can be explained either by large differences in relative frequencies (or probabilities) of frequent words (usually corresponding to functional terms), topical words, or geographical terms. We were able to show that there exists a difference in writing style between the genders and the tested language variety groups.</p><p>To improve the current classifier, we could investigate the effect of other feature selection strategies. In this case, we want to maintain a reduced number of terms but we can take more account of the underlying text genre, as for example, the frequent use of emoji in tweets contain more implicit expressions and meanings. Furthermore, we could use external resources to harvest geographical names related to the different countries and regions to facilitate the language variety prediction. As another possible improvement, we can ignore terms only appearing infrequently in a class. One might also try to exploit PAN specific properties such as the requirement for equally distributed male/female problems and for the language variety groups.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>PAN CLEF 2017 corpora statistics.</figDesc><table><row><cell>Corpus</cell><cell>Language Varieties</cell><cell cols="3">Training Samples Terms Samples Testing</cell></row><row><cell>Arabic</cell><cell>Egypt, Gulf, Levantine, Maghrebi</cell><cell>2,400</cell><cell>1,241.8</cell><cell>1,600</cell></row><row><cell></cell><cell>Australia, Canada,</cell><cell></cell><cell></cell><cell></cell></row><row><cell>English</cell><cell>Great Britain, Ireland,</cell><cell>3,600</cell><cell>1,628.5</cell><cell>2,400</cell></row><row><cell></cell><cell>New Zealand, United States</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>Argentina, Chile,</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Spanish</cell><cell>Colombia, Mexico,</cell><cell>4,200</cell><cell>1,472.3</cell><cell>2,800</cell></row><row><cell></cell><cell>Peru, Spain, Venezuela</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Portuguese</cell><cell>Brazil, Portugal</cell><cell>1,200</cell><cell>1,202.3</cell><cell>800</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Contingency table for a term ω and in a class Γ.</figDesc><table><row><cell></cell><cell>Γ</cell><cell>¬Γ</cell><cell></cell></row><row><cell>ω</cell><cell>a</cell><cell>b</cell><cell>a+b</cell></row><row><cell>¬ω</cell><cell>c</cell><cell>d</cell><cell>c+d</cell></row><row><cell></cell><cell cols="2">a+c b+d</cell><cell>n</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Parameter summarization.</figDesc><table><row><cell>Parameter</cell><cell></cell><cell>First Run</cell><cell>Second Run</cell></row><row><cell>Distance</cell><cell></cell><cell>Canberra</cell><cell>Clark</cell></row><row><cell cols="2">Feature selection method</cell><cell>Gain Ratio</cell><cell>Gain Ratio</cell></row><row><cell>m features</cell><cell>each gender each variety</cell><cell>100 highest 10 lowest 70 highest 10 lowest</cell><cell>100 highest 0 lowest 70 highest 0 lowest</cell></row><row><cell>k neighbors</cell><cell></cell><cell>9 in AR &amp; PT 9 in EN &amp; SP</cell><cell>15 in AR &amp; PT 25 in EN &amp; SP</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4a .</head><label>4a</label><figDesc>Evaluation for the four training collections with the first run.</figDesc><table><row><cell>Language</cell><cell>Joint</cell><cell>Gender</cell><cell>Variety</cell></row><row><cell>Arabic</cell><cell>0.5021</cell><cell>0.6854</cell><cell>0.7175</cell></row><row><cell>English</cell><cell>0.3772</cell><cell>0.6928</cell><cell>0.5411</cell></row><row><cell>Spanish</cell><cell>0.4117</cell><cell>0.6445</cell><cell>0.6419</cell></row><row><cell>Portuguese</cell><cell>0.7600</cell><cell>0.7742</cell><cell>0.9808</cell></row><row><cell>Overall</cell><cell>0.5128</cell><cell>0.6992</cell><cell>0.7203</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4b .</head><label>4b</label><figDesc>Evaluation for the four training collections with the second run.</figDesc><table><row><cell>Language</cell><cell>Joint</cell><cell>Gender</cell><cell>Variety</cell></row><row><cell>Arabic</cell><cell>0.5292</cell><cell>0.6954</cell><cell>0.7375</cell></row><row><cell>English</cell><cell>0.4581</cell><cell>0.7192</cell><cell>0.6392</cell></row><row><cell>Spanish</cell><cell>0.4762</cell><cell>0.6745</cell><cell>0.7169</cell></row><row><cell>Portuguese</cell><cell>0.7850</cell><cell>0.7967</cell><cell>0.9842</cell></row><row><cell>Overall</cell><cell>0.5621</cell><cell>0.7215</cell><cell>0.7695</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 5a .</head><label>5a</label><figDesc>Evaluation for the four testing collections with the first run.</figDesc><table><row><cell>Language</cell><cell>Joint</cell><cell>Gender</cell><cell>Variety</cell></row><row><cell>Arabic</cell><cell>0.5119</cell><cell>0.6781</cell><cell>0.7106</cell></row><row><cell>English</cell><cell>0.3879</cell><cell>0.6996</cell><cell>0.5596</cell></row><row><cell>Spanish</cell><cell>0.4464</cell><cell>0.6711</cell><cell>0.6611</cell></row><row><cell>Portuguese</cell><cell>0.7400</cell><cell>0.7625</cell><cell>0.9713</cell></row><row><cell>Overall</cell><cell>0.5216</cell><cell>0.7028</cell><cell>0.7257</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 5b .</head><label>5b</label><figDesc>Evaluation for the four testing collection with the second run.</figDesc><table><row><cell>Language</cell><cell>Joint</cell><cell>Gender</cell><cell>Variety</cell></row><row><cell>Arabic</cell><cell>0.5206</cell><cell>0.6913</cell><cell>0.7188</cell></row><row><cell>English</cell><cell>0.4650</cell><cell>0.7163</cell><cell>0.6521</cell></row><row><cell>Spanish</cell><cell>0.4971</cell><cell>0.6846</cell><cell>0.7211</cell></row><row><cell>Portuguese</cell><cell>0.7575</cell><cell>0.7788</cell><cell>0.9725</cell></row><row><cell>Overall</cell><cell>0.5601</cell><cell>0.7178</cell><cell>0.7661</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 6 .</head><label>6</label><figDesc>Evaluation over all four test collections.</figDesc><table><row><cell>Approach</cell><cell>Average Gender</cell><cell>Average Variety</cell><cell>Average Joint</cell><cell>Overall</cell></row><row><cell>Basile et al.</cell><cell>0.8253</cell><cell>0.9184</cell><cell>0.8361</cell><cell>0.8599</cell></row><row><cell>LDR baseline</cell><cell>0.7325</cell><cell>0.9187</cell><cell>0.7750</cell><cell>0.8087</cell></row><row><cell cols="2">Kocher &amp; Savoy 0.7178</cell><cell>0.7661</cell><cell>0.6813</cell><cell>0.7217</cell></row><row><cell>PAN average</cell><cell>0.6561</cell><cell>0.7099</cell><cell>0.6333</cell><cell>0.6664</cell></row><row><cell>BOW baseline</cell><cell>0.6763</cell><cell>0.6907</cell><cell>0.6195</cell><cell>0.6622</cell></row><row><cell>STAT baseline</cell><cell>0.5000</cell><cell>0.2649</cell><cell>0.2991</cell><cell>0.3547</cell></row><row><cell cols="4">1 http://pan.webis.de/clef17/pan17-web/author-profiling.html</cell><cell></cell></row><row><cell cols="2">2 http://www.tira.io/task/author-profiling/</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 7 .</head><label>7</label><figDesc>Top 10 terms selected using mutual information</figDesc><table><row><cell cols="2">Category</cell><cell cols="3">Top terms (space separated)</cell></row><row><cell></cell><cell>Female</cell><cell>‫إنى‬</cell><cell>‫عارفه‬</cell><cell>‫حبيبتي‬ ‫ماما‬ ‫عايزه‬ ‫سلمى‬ ‫عايزة‬ ‫عارفة‬</cell></row><row><cell>Arabic</cell><cell>Male Egypt Gulf</cell><cell cols="3">liked ‫حان‬ video ‫شرح‬ ٰ ‫مدريد‬ ‫تغريدة‬ ‫حازم‬ ‫الدوري‬ ‫مدرب‬ ‫ده‬ ‫بقت‬ ‫دلوقتي‬ ‫النهاردة‬ ‫انتى‬ ‫يعنى‬ ‫كام‬ ‫تانى‬ ‫اللى‬ ‫دى‬ ‫كفو‬ ‫بو‬ ٰ ‫مافيه‬ ‫فيني‬ ‫دايم‬ ‫الكويت‬ ‫محد‬ ‫شلون‬ ‫الحين‬</cell></row><row><cell></cell><cell>Levantine</cell><cell cols="3">‫منيح‬ ‫بده‬ ‫هأل‬ ‫حدا‬ ‫هيك‬ ‫هاي‬ ‫ردن‬</cell><cell>‫األ‬ ‫بدك‬ ‫االردن‬ ‫اشي‬</cell></row><row><cell></cell><cell>Maghrebi</cell><cell>‫تاع‬</cell><cell>‫معجزة‬</cell><cell>‫التونسي‬ ‫#صالة‬ ‫تونس‬ ‫هدا‬ ‫الجزائر‬ ‫طرابلس‬ ‫ليبيا‬</cell></row><row><cell></cell><cell>Female</cell><cell cols="3">leo taurus virgo xxx</cell><cell>makeup xx bingo</cell></row><row><cell></cell><cell>Male</cell><cell cols="3">)' badge arsenal earned league microsoft wire players developer rangers</cell></row><row><cell></cell><cell>Australia</cell><cell cols="3">canberra turnbull sydney aust abbott malcolm jp adelaide scarlet aus</cell></row><row><cell>English</cell><cell>Canada GB Ireland</cell><cell cols="3">vancouver toronto canadians canadian edinburgh filthy glasgow factual unlimited reuters mural bristol 220 nhl txt canvas rsvp drafted gems dublin ireland commented irish scorpio jedward byrne conor capricorn belfast</cell></row><row><cell></cell><cell>NZ</cell><cell cols="3">wellington auckland nz kiwi zealand dunedin earthquake )' roundup</cell></row><row><cell></cell><cell>US</cell><cell cols="3">gorsuch emerald dems ethan scotus dc aca obamacare infamous nsc</cell></row><row><cell></cell><cell>Venezuela</cell><cell cols="3">mud zulia vzla caracas chavista an venezolanos medicinas chavismo hampa</cell></row><row><cell>Portuguese</cell><cell>Female Male Brazil Portugal</cell><cell cols="3">sozinha cansada obrigada ranking achavam simpático apaixonada acordada link eduardo | obrigado milhões by • ): jogador campeonato enviadas tô fazendo vc você kkkkk at kkkkkk brasil querendo assistir tou portugal isto cenas crlh gira xd merdas percebo lol</cell></row></table><note>Spanish Female ♡ orgullosa cansada pedidos nosotras angie dormida siiii celosa Male dt jugó rival refuerzos delantero clubes colo cont zidane libertadores Argentina posta hs podes vos orto lpm pelotuda bue pelotudo macri Chile wn piñera colo lagos incendios po metropolitana forestales viña chilenos Colombia bogotá bogota uribe corridas boletas falcao lleras cali plebiscito medellín Mexico neta mx éxico monterrey pinches duarte hidalgo slim pri Peru ppk lima peruanos soles ptm perú peru oe muni peruano Spain psoe os vosotros enhorabuena goya pp rajoy vuestro alcalá cataluña</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">Some terms depend on the context in which they are used and can't be translated accurately.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments. The author wants to thank the task coordinators for their valuable effort to promote test collections in author profiling. This research was supported, in part, by the NSF under Grant #200021_149665/1.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Delta: A Measure of Stylistic Difference and a Guide to Likely Authorship</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F</forename><surname>Burrows</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Literary and Linguistic Computing</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="267" to="287" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Ousting Ivory Tower Research: Towards a Web Framework for Providing Experiments as a Service</title>
		<author>
			<persName><forename type="first">T</forename><surname>Gollub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Burrows</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR. The 35 th International ACM</title>
				<editor>
			<persName><forename type="first">B</forename><surname>Hersh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Callan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Maarek</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Sanderson</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1125" to="1126" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The Elements of Statistical Learning</title>
		<author>
			<persName><forename type="first">T</forename><surname>Hastie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tibshirani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Friedman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Data Mining, Inference, and Prediction</title>
				<meeting><address><addrLine>New York (NY</addrLine></address></meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Pennebaker</surname></persName>
		</author>
		<title level="m">The Secret Life of Pronouns. What our Words Say about us</title>
				<meeting><address><addrLine>New York (NY</addrLine></address></meeting>
		<imprint>
			<publisher>Bloomsbury Press</publisher>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Improving the Reproducibility of PAN&apos;s Shared Tasks: -Plagiarism Detection, Author Identification, and Author Profiling</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gollub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="s">CLEF. Lecture Notes in Computer Science</title>
		<editor>Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Handbury, A., &amp; Toms, E.</editor>
		<imprint>
			<biblScope unit="volume">8685</biblScope>
			<biblScope unit="page" from="268" to="299" />
			<date type="published" when="2014">2014</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter</title>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">CLEF 2017 Labs and Workshops</title>
		<title level="s">Notebook Papers. CEUR Workshop Proceedings</title>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Comparative Evaluation of Term Selection Functions for Authorship Attribution</title>
		<author>
			<persName><forename type="first">J</forename><surname>Savoy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Digital Scholarship in the Humanities</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="246" to="261" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Machine Learning in Automatic Text Categorization</title>
		<author>
			<persName><forename type="first">F</forename><surname>Sebastiani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Survey</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="27" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
