<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Using TF-IDF n-gram and Word Embedding Cluster Ensembles for Author Profiling Notebook for PAN at CLEF 2017</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Adam</forename><surname>Poulston</surname></persName>
							<email>arspoulston1@sheffield.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Zeerak</forename><surname>Waseem</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mark</forename><surname>Stevenson</surname></persName>
							<email>mark.stevenson@sheffield.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Using TF-IDF n-gram and Word Embedding Cluster Ensembles for Author Profiling Notebook for PAN at CLEF 2017</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">E67733173AB6084D73E6378A7FCE5E3B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:30+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents our approach and results for the 2017 PAN Author Profiling Shared Task. Language-specific corpora were provided for four langauges: Spanish, English, Portuguese, and Arabic. Each corpus consisted of tweets authored by a number of Twitter users labeled with their gender and the specific variant of their language which was used in the documents (e.g. Brazilian or European Portuguese). The task was to develop a system to infer the same attributes for unseen Twitter users. Our system employs an ensemble of two probabilistic classifiers: a Logistic regression classifier trained on TF-IDF transformed n-grams and a Gaussian Process classifier trained on word embedding clusters derived for an additional, external corpus of tweets.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Author profiling is the task of determining the characteristics of the individual who wrote a document. Many different characteristics can be determined (e.g. personal characteristics such as gender, age, personality <ref type="bibr" target="#b18">[19]</ref> and socioeconomic indicators <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15]</ref>) across a variety of media (e.g. written essays, books, blogs and other social media). Despite their potential ethical concerns, author profiling techniques can be a valuable component in various applications, such as bias reduction in predictive models <ref type="bibr" target="#b1">[2]</ref> and language-variant adaption in part-of-speech taggers <ref type="bibr" target="#b0">[1]</ref>.</p><p>In this paper, we present our approach to the 2017 edition of the PAN Author Profiling shared task <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b15">16]</ref>. A dataset was provided consisting of Twitter users across four languages and their variants. Each user was labeled with a binary gender label (male/female) and the particular variant of their language (e.g. Brazilian vs European Portuguese). The dataset was balanced by both gender and language variant. Given an unseen user (and their native language), the task is to determine their gender and language variant being used.</p><p>To predict gender and language variant, we applied an ensemble of probabilistic machine learning classifiers (described in detail in Section 2). First, an external Twitter corpus was acquired and Tweets geo-located within the countries covered in the tasks languages were extracted (except for the Arabic language variants). This corpus was divided into individual languages (Portuguese, English and Spanish) and used to derive Word2Vec word embeddings <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref> for each language. Then, each set of language specific word embeddings were clustered using K-Means to derive a set of word to cluster mappings, which can be thought of as roughly analogous to topics in a topic model. The normalised frequency of each word cluster across a user's tweets was used to train a Gaussian Process classifier. Second, a Logistic Regression classifier was then trained using TF-IDF transformed unigram and bigram frequencies. Both classifiers were employed in an ensemble approach by averaging the predicted probabilities for each sample to determine the label.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Approach</head><p>Our approach combines two probabilistic classifiers trained on distinct feature sets in an ensemble to predict gender and language variant. Two classifiers were applied: a Logistic Regression classifier trained on TF-IDF n-grams (Section 2.1) and a Gaussian Process classifier trained on word cluster frequencies (Section 2.2). For each unseen document, probabilities from both classifiers are taken and averaged, and the highest average probability class is taken as the prediction. Models were trained using the implementations found in scikit-learn <ref type="bibr" target="#b8">[9]</ref> unless stated otherwise.</p><p>For Arabic data, only the Logistic Regression classifier is applied, as the volume of geo-located Arabic tweets collected was too low to allow for training of robust Word2Vec models for use with the Gaussian Process classifier.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Logistic regression classifier with TF-IDF n-grams</head><p>Word unigram and bigram features were extracted for each training document. The text was tokenised using a Twitter-aware tokeniser <ref type="bibr" target="#b3">[4]</ref>; no additional steps were taken to deal with the extra complexities of Arabic text. A list of stop words was not used while deriving n-gram features, instead tokens that appeared in more than 90% of the documents were removed, as this allows for the removal of n-grams common across a language's variants while also removing stop words.</p><p>TF-IDF weighting was applied to down-weight n-grams common across the documents and assign a higher weight to n-grams which are rare.</p><p>A Logistic Regression classifier was trained for each language using the n-gram features. Logistic Regression was chosen for use with the n-gram features because it has been shown to perform well on similar high-dimensional classification tasks, and produces probabilistic predictions <ref type="bibr" target="#b2">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Gaussian process classifier with word embedding clusters</head><p>We obtained the data for our word embedding clusters from a Twitter Firehose<ref type="foot" target="#foot_0">1</ref> sample collected throughout 2015. We only used tweets that were geo-located in the specific language regions determined by the shared task (see Table <ref type="table" target="#tab_0">1</ref>). Some language variants were less frequent in the resulting datasets than others, for instance we collected very few tweets from Ireland compared to the U.S.A. Downsampling was used to avoid over representation of the more prevent language variants. Data for the language variant with the largest volume of documents was reduced so that it contained no more than 10 times number of tweets of the smallest language variant.</p><p>Word embeddings For each language dataset (F en , F es , and F pt ) were trained using the Word2Vec <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref> implementation in gensim <ref type="bibr" target="#b17">[18]</ref> with Continuous Bag of Words (CBOW), negative sampling, 200 dimensions, and a window size of 10.</p><p>We applied K-Means clustering <ref type="bibr" target="#b5">[6]</ref> to the word embeddings to derive a set of 100 clusters for each language, in which each word is assigned a cluster based on its nearest cluster in the embedding space. We then computed the frequency distribution of the clusters for every training document, and used them as features to train a Gaussian Process classifier with an RBF kernel <ref type="bibr" target="#b16">[17]</ref>.</p><p>Similar word embedding clusters have been applied with Gaussian Processes to perform other author profiling tasks such as socio-economic status detection <ref type="bibr" target="#b4">[5]</ref>; furthermore, the derived clusters are similar to topics derived in a topic model, in that they identify semantically similar groups of words in documents, which we found to perform well in a similar task <ref type="bibr" target="#b11">[12]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Results</head><p>Table <ref type="table" target="#tab_1">2</ref> shows the accuracy scores achieved by a Support Vector Machine (SVM) classifier with a linear kernel, trained on the same TF-IDF n-grams described in Section 2.1. We chose this approach as our baseline, as it has been shown to perform well on similar tasks and represent a strong baseline. Table <ref type="table" target="#tab_2">3</ref> shows the results of our final submitted run for the PAN: Author Profiling task 2017. For Spanish, English and Portuguese the results were attained by applying the ensemble of Logistic Regression and Gaussian Process classifiers described in Section 2; for Arabic only the Logistic regression classifier was applied (Section 2.1). In the rankings for the PAN Author Profiling shared task <ref type="bibr" target="#b15">[16]</ref>, our approach achieved 7th place out of 22 entries for joint prediction and 6th for gender, exceeding reported baselines. We achieved poorer results for language variant prediction at 9th place, and did not exceed the baseline approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Discussion</head><p>In Table <ref type="table" target="#tab_2">3</ref>, we see that the our ensemble performs quite well for identifying language variant or gender individually. For joint prediction our ensemble performs less well, likely due to errors in either gender or language variant prediction propagating through to incorrect joint predictions. Of the three languages the ensemble was applied to, the best performance was observed for Portuguese and the worst for English. Broad topics of interest appear to be effective for the gender prediction problem while individual terms that are unique to specific language variants are more discriminating for language variant prediction.</p><p>Similar to our results in a previous PAN: Author Profiling Profiling shared task entry <ref type="bibr" target="#b11">[12]</ref>, in which LDA topic models were able to improve predictive performance over word n-grams, word embedding clusters improved predictive accuracy for gender classification. For the language variant differentiation task, introducing the word embedding clusters in fact reduced accuracy scores over earlier runs.</p><p>Under our current clustering scheme, each term was assumed to be equally as representative of its cluster as each other term; in practise though, certain terms were closer to the centroid in embedding space than others. Prior to submission we had begun experimenting with weighting terms based on their proximity to their closest centroid, and our initial findings were promising. In future work we would like to investigate the effect of weighting terms in more detail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>In this notebook, we have shown that by employing an ensemble of classifiers and utilising clusters of word embeddings reasonable results can be achieved. We propose, that our approach can be improved by weighting the word embedding clusters by the distance to the cluster centroid.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Countries scraped for each language.</figDesc><table><row><cell>English (Fen)</cell><cell>Spanish (Fsp)</cell><cell>Portuguese (Fpt)</cell></row><row><cell>Australia</cell><cell>Argentina</cell><cell>Brazil</cell></row><row><cell>Canada</cell><cell>Chile</cell><cell>Portugal</cell></row><row><cell>Great Britain</cell><cell>Colombia</cell><cell></cell></row><row><cell>Ireland</cell><cell>Mexico</cell><cell></cell></row><row><cell>New Zealand</cell><cell>Peru</cell><cell></cell></row><row><cell>United States</cell><cell>Spain</cell><cell></cell></row><row><cell></cell><cell>Venezuela</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Baseline accuracy scores for gender and language variant prediction for each language derived from a SVM classifier trained on TF-IDF n-grams.</figDesc><table><row><cell>Target</cell><cell cols="3">Spanish English Portuguese Arabic</cell></row><row><cell>Gender</cell><cell>0.7361 0.7896</cell><cell>0.8263</cell><cell>0.7450</cell></row><row><cell cols="2">Language variant 0.9532 0.8617</cell><cell>0.9800</cell><cell>0.8150</cell></row><row><cell>Joint</cell><cell>0.7007 0.6838</cell><cell>0.8113</cell><cell>0.6275</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Accuracy scores for gender and language variant prediction for each language as submitted for the PAN: Author Profiling task 2017.</figDesc><table><row><cell>Target</cell><cell cols="3">Spanish English Portuguese Arabic</cell></row><row><cell>Gender</cell><cell>0.7939 0.7829</cell><cell>0.8388</cell><cell>0.7738</cell></row><row><cell cols="2">Language variant 0.9368 0.8038</cell><cell>0.9763</cell><cell>0.7975</cell></row><row><cell>Joint</cell><cell>0.7471 0.6254</cell><cell>0.8188</cell><cell>0.6356</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Twitter Firehose has since been discontinued and can no longer be accessed.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Demographic dialectal variation in social media: A case study of african-american english</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">L</forename><surname>Blodgett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Green</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>O'connor</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016-11">November 2016</date>
			<biblScope unit="page" from="1119" to="1130" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Reducing sampling bias in social media data for county health inference</title>
		<author>
			<persName><forename type="first">A</forename><surname>Culotta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Joint Statistical Meetings Proceedings</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Freedman</surname></persName>
		</author>
		<title level="m">Statistical models: theory and practice</title>
				<imprint>
			<publisher>cambridge university press</publisher>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Part-of-speech tagging for Twitter: annotation, features, and experiments</title>
		<author>
			<persName><forename type="first">K</forename><surname>Gimpel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Schneider</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>O'connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mills</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Eisenstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Heilman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yogatama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Flanigan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Human Language Technologies</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="42" to="47" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">users based on behaviour and language</title>
		<author>
			<persName><forename type="first">V</forename><surname>Lampos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Aletras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">K</forename><surname>Geyti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">J</forename><surname>Cox</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Inferring the socioeconomic status of social media</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Some methods for classification and analysis of multivariate observations</title>
		<author>
			<persName><forename type="first">J</forename><surname>Macqueen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the fifth Berkeley symposium on mathematical statistics and probability</title>
				<meeting>the fifth Berkeley symposium on mathematical statistics and probability<address><addrLine>Oakland, CA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1967">1967</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="281" to="297" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Efficient estimation of word representations in vector space</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1301.3781</idno>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Distributed representations of words and phrases and their compositionality</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">S</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="3111" to="3119" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Scikit-learn: Machine learning in Python</title>
		<author>
			<persName><forename type="first">F</forename><surname>Pedregosa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Varoquaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gramfort</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Michel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Thirion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Grisel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Blondel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Prettenhofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Weiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Dubourg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Vanderplas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Passos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cournapeau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brucher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Perrot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Duchesnay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="2825" to="2830" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Improving the Reproducibility of PAN&apos;s Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gollub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Lupu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Clough</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Sanderson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Hall</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Hanbury</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Toms</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014-09">Sep 2014</date>
			<biblScope unit="page" from="268" to="299" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Overview of PAN&apos;17: Author Identification, Author Profiling, and Author Obfuscation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tschuggnall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Initiative (CLEF 17)</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Jones</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Lawless</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Gonzalo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017-09">Sep 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Topic models and n-gram language models for author profiling-notebook for pan at clef</title>
		<author>
			<persName><forename type="first">A</forename><surname>Poulston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stevenson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bontcheva</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">User profiling with geo-located posts and demographic data</title>
		<author>
			<persName><forename type="first">A</forename><surname>Poulston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stevenson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bontcheva</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016-11">November 2016</date>
			<biblScope unit="page" from="43" to="48" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">An analysis of the user occupational class through Twitter content</title>
		<author>
			<persName><forename type="first">D</forename><surname>Preoţiuc-Pietro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Lampos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Aletras</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</title>
		<title level="s">Long Papers</title>
		<meeting>the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1754" to="1764" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Studying user income through language, behaviour and affect in social media</title>
		<author>
			<persName><forename type="first">D</forename><surname>Preoţiuc-Pietro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Volkova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Lampos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bachrach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Aletras</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PloS one</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">9</biblScope>
			<biblScope unit="page">e0138717</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter</title>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes Papers of the CLEF 2017 Evaluation Labs</title>
		<title level="s">CEUR Workshop Proceedings, CLEF and CEUR-WS</title>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2017-09">sep 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Gaussian processes for machine learning</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">E</forename><surname>Rasmussen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">K</forename><surname>Williams</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
			<publisher>MIT press Cambridge</publisher>
			<biblScope unit="volume">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Software Framework for Topic Modelling with Large Corpora</title>
		<author>
			<persName><forename type="first">R</forename><surname>Řehůřek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sojka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</title>
				<meeting>the LREC 2010 Workshop on New Challenges for NLP Frameworks<address><addrLine>Valletta, Malta</addrLine></address></meeting>
		<imprint>
			<publisher>ELRA</publisher>
			<date type="published" when="2010-05">May 2010</date>
			<biblScope unit="page" from="45" to="50" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Personality, gender, and age in the language of social media: The open-vocabulary approach</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">A</forename><surname>Schwartz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Eichstaedt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Kern</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dziurzynski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Ramones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kosinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Stillwell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E P</forename><surname>Seligman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">H</forename><surname>Ungar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PLoS ONE</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">9</biblScope>
			<biblScope unit="page">e73791</biblScope>
			<date type="published" when="2013-09">09 2013</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
