<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Twitter Author Profiling Using Word Embeddings and Logistic Regression Notebook for PAN at CLEF 2017</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Liliya</forename><surname>Akhtyamova</surname></persName>
							<email>liliya.akhtyamova@postgrad.ittdublin.ie</email>
							<affiliation key="aff0">
								<orgName type="department">Institute of Technology Tallaght</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">John</forename><surname>Cardiff</surname></persName>
							<email>john.cardiff@it-tallaght.ie</email>
							<affiliation key="aff0">
								<orgName type="department">Institute of Technology Tallaght</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andrey</forename><surname>Ignatov</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">ETH Zurich</orgName>
								<address>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Twitter Author Profiling Using Word Embeddings and Logistic Regression Notebook for PAN at CLEF 2017</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">089B5AFEAF4F2FF3BD625DB0DAE0D5F1</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:30+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The general goal of the author profiling task is to determine various social and demographic aspects of the author based on his pieces of writing. In this work, we propose an approach that combines word embeddings and classical logistic regression for identifying author gender and language variety based on the corresponding tweets. The model was trained on PAN 2017 Twitter Corpus that contains data for English, Spanish, Portuguese and Arabic languages from more than 11 thousand authors. Due to its simplicity, the proposed solution can be treated as a baseline for both gender and language variety identification subtasks.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>With the world becoming more digital, the personal data of internet users such as their gender, age, ethnicity or personality type is playing more and more important role in the modern life. This information can be used for providing relevant search results, recommending appropriate connections in social networks, fraud prediction or personalized advertising. While there have been already proposed numerous solutions to tackle this kind of problems, the majority of them were relying on hand-designed features and various heuristics. In this works, we propose a simple fully-automated way of performing author profiling based on text data. The detailed architecture of our system is described in the following sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Models and Methods</head><p>In this section, we give an overview of the proposed method and describe its main components.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Dataset</head><p>In this work, we use PAN 2017 Author Profiling Dataset <ref type="bibr" target="#b1">[2]</ref> published along with the corresponding shared task <ref type="bibr" target="#b0">[1]</ref>. This dataset contains tweets from 11400 users for English, Spanish, Portuguese and Arabic languages. For each tweet, there is an information about his author id, author gender and author language variety. The language variations presented in this dataset are the following:</p><p>• English: Australia, Canada, Great Britain, Ireland, New Zealand, United States • Spanish: Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela</p><formula xml:id="formula_0">• Portuguese: Brazil, Portugal • Arabic: Egypt, Gulf, Levantine, Maghrebi</formula><p>Overall, there are 360K, 420K, 120K and 240K tweets for English, Spanish, Portuguese and Arabic languages, respectively. These tweets are further used as an input data for our algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Input Processing</head><p>In this task, the input to our classification model has the form of the sentence (tweet) T that is treated as an ordered sequence of words: T = {w 1 , w 2 , ..., w N }. First, plain words are mapped to their vector representations using a pre-trained word embedding model, which in our case is word2vec. The resulting representations are summed and averaged to form a single sentence vector M T , which dimensionality d is equal to the dimensionality of word embeddings. This vector is then passed to Logistic Regression classification algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Logistic Regression</head><p>Multinomial Logistic Regression is the linear regression analysis to conduct when the dependent variable is nominal with more than two levels. Thus it is an extension of logistic regression, which analyzes dichotomous (binary) dependents. Like all linear regressions, the multinomial regression is a predictive analysis. Multinomial regression is used to describe data and to explain the relationship between one dependent nominal variable and one or more continuous-level(interval or ratio scale) independent variables.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Alternatives</head><p>In our initial experiments we have additionally tried a number of different methods to solve the considered author profiling problem. Among them were bag-of-words model, that takes into account the multiplicity of the appearing words in each sentence, and CNN-based solution that performs sentence classification based on the sentence matrix <ref type="bibr" target="#b3">[4]</ref>. Besides that, we have also tried various classifiers: Random Forest, Linear Regression, Naive Bayes, SVMs, etc. However, our experiments revealed that these solutions demonstrate the same or worse performance compared to the one proposed in this work, therefore it was chosen for our submission and final evaluation. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experiments</head><p>To generate word embeddings, we trained a CBOW word2vec model on the initial twitter corpus with a context window of size 5, and a vector dimensionality of 300. A separate word2vec model was trained for each major language (4 models in total), to do this Python gensim [3] library was used. Numpy library was utilized for general array manipulation, and the implementation of Logistic Regression was taken from sklearn machine learning library. The classifier was trained to minimize L 2 loss function, the regularization coefficient was set to C = 0.1.</p><p>The dataset was split into tow subsets: 90% of the data was used for training the model, and the rest 10% -for validation. The final results on the test dataset for each language are presented in the table 1. The proposed model was able to achieve the accuracy of 64% for Arabic language and 68 -74% for the rest languages in the gender identification subtask. The results in the language identification task were highly dependent on the number of predictive classes and the difference between the dialects: 97% for Portuguese (2 classes), 44% for Arabic (4 classes), 58% for English (6 classes) and 80% for Spanish (7 classes). Relatively week results for Arabic language can be explained by the fact that the original hieroglyphs were encoded into unicode, and thus some relevance between similar hieroglyphs was lost.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion and Future Work</head><p>In this work, we presented an approach for gender and language variety identification on twitter data, that is based on the combination of word embeddings and logistic regression models. The proposed solution has the benefits of requiring no hand-designed features and being applicable to various nlp domains without a need for modifications in the implementation.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>The results of the proposed model on both subtasks for four languages.</figDesc><table><row><cell>Language</cell><cell cols="2">Language variety identification accuracy, % Gender identification accuracy, %</cell></row><row><cell>English</cell><cell>58.13</cell><cell>74.46</cell></row><row><cell>Spanish</cell><cell>80.32</cell><cell>69.46</cell></row><row><cell>Portuguese</cell><cell>97.63</cell><cell>68.50</cell></row><row><cell>Arabic</cell><cell>44.88</cell><cell>64.25</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Overview of PAN&apos;17: Author Identification, Author Profiling, and Author Obfuscation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tschuggnall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Initiative (CLEF 17)</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Jones</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Lawless</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Gonzalo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017-09">Sep 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<title level="m">Working Notes Papers of the CLEF 2017 Evaluation Labs</title>
				<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Software Framework for Topic Modelling with Large Corpora</title>
		<author>
			<persName><forename type="first">R</forename><surname>Řehůřek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sojka</surname></persName>
		</author>
		<ptr target="http://is.muni.cz/publication/884893/en" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</title>
				<meeting>the LREC 2010 Workshop on New Challenges for NLP Frameworks<address><addrLine>Valletta, Malta</addrLine></address></meeting>
		<imprint>
			<publisher>ELRA</publisher>
			<date type="published" when="2010-05">May 2010</date>
			<biblScope unit="page" from="45" to="50" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Twitter sentiment analysis with deep convolutional neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Severyn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moschitti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</title>
				<meeting>the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="959" to="962" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
