<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Multimodal Author Profiling for Twitter Notebook for PAN at CLEF 2018</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Braja</forename><forename type="middle">Gopal</forename><surname>Patra</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Department of Biostatistics and Data Science</orgName>
								<orgName type="department" key="dep2">School of Public Health</orgName>
								<orgName type="institution">University of Texas Health Science Center (UTHealth)</orgName>
								<address>
									<settlement>Houston</settlement>
									<region>TX</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Kumar</forename><surname>Gourav Das</surname></persName>
							<email>kumargouravdas18@gmail.com</email>
							<affiliation key="aff1">
								<orgName type="department">Department of Computer Science &amp; Engineering</orgName>
								<orgName type="institution">Future Institute of Engineering &amp; Management</orgName>
								<address>
									<settlement>Kolkata</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Department of Computer Science &amp; Engineering</orgName>
								<orgName type="institution">Jadavpur University</orgName>
								<address>
									<settlement>Kolkata</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Dipankar</forename><surname>Das</surname></persName>
						</author>
						<title level="a" type="main">Multimodal Author Profiling for Twitter Notebook for PAN at CLEF 2018</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">6B1F0CE0479A519A68473F08F44CAC6B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:33+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Author profiling</term>
					<term>gender detection</term>
					<term>latent semantic analysis</term>
					<term>latent dirichlet allocation</term>
					<term>word embeddings</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Author profiling is gaining the interest of people in both academia and outside it. Author profiling/analysis deals with the identification of author information from text based on stylistic choices. It helps in identifying author related information such as gender, age, native language, personality, demographics, etc. Thus, author profiling is both challenging and important. This paper describes the systems submitted to author profiling task at PAN-2018 using multimodal (textual and image) Twitter datasets provided by the organizers and the aim is to identify the author's gender. An image captioning system was used to extract captions from images. Mainly latent semantic analysis, word embeddings, and stylistic features were extracted from tweets as well as captions. The proposed multimodal author profiling systems obtained classification accuracies of 0.7680, 0.7737, and 0.7709 for Arabic, English and Spanish languages, respectively using support vector machine.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Social media has become an integral part of human life. People often spend a lot of time on social media. Further, they also input text data which is prone to noise elements like typos, and grammatical mistakes. Thus it is both challenging and necessary to uncover various characteristics of the author from such noisy social media text. Author profiling (AP) is essential in several areas including marketing, forensic science, and security. For example, from a marketing perspective, it is always useful to know details about authors of text in blogs and reviews, so that relevant recommendations can be provided to users. The linguistic profile of an author of abusive message would be helpful from a forensic linguistics viewpoint.</p><p>AP gained importance as a research area since the last decade <ref type="bibr" target="#b16">[17]</ref>. Initially, AP was only based on text data generated by authors <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b24">25]</ref>. In <ref type="bibr">PAN-2018</ref>, a new research trend in AP was started, as both text and image data were made available to be used for AP. PAN is a series of scientific events and shared tasks on digital text forensics <ref type="foot" target="#foot_0">1</ref> .</p><p>In this paper, we perform gender identification from multimodal Twitter data, provided by the organizers of AP task<ref type="foot" target="#foot_1">2</ref> at PAN-2018. The major focus is on social media text as we are interested in how everyday language reflects on social and personal choices. The organizers provided tweets and photos of users using either of three languages namely, Arabic, English, and Spanish. The training dataset consists of data obtained from 3000 users each of English and Spanish languages, while there are only 1500 users of Arabic language.</p><p>For English dataset, we identified several important textual features including words embeddings and stylistic features. An image captioning system was used to extract captions from images, and then the above textual features were identified from the captions. In contrast, a language-independent approach was used for Arabic and Spanish datasets. We collected term frequency-inverse document frequency (TF-IDF) of unigrams, then singular value decomposition (SVD) was implemented on TF-IDF vectors to reduce sparsity. Finally, latent semantic analysis (LSA) was used on the reduced vectors to get the final feature vectors. Support vector machine (SVM) was implemented for classification purpose.</p><p>Rest of the paper is organized in the following manner. Section 2 discusses related work briefly. Section 3 provides an overview of data, features, system architecture, and techniques used in the experiments. Section 4 describes a detailed analysis of results. Finally, conclusions and future directions are listed in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>AP focuses on the prediction of demographics and psychometric traits (age, gender, native language, personality, religion) of an author using stylistic and content-based features. AP has many applications in academic research, marketing, security and forensic analysis. Initially, research on AP was conducted on English language <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b24">25]</ref> and later gained popularity in other languages like Dutch <ref type="bibr" target="#b12">[13]</ref>, Greek <ref type="bibr" target="#b7">[8]</ref>, Italian <ref type="bibr" target="#b20">[21]</ref>, Spanish <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b24">25]</ref>, Vietnamese <ref type="bibr" target="#b18">[19]</ref>, and so on.</p><p>There has been much research on AP from blogs as well as social media texts. Bayot and Gonçalves <ref type="bibr" target="#b5">[6]</ref> performed age and gender classification on PAN-2016 AP datasets using TF-IDF scores and word embeddings. Classification was performed using support vector machine (SVM) and results showed that TF-IDF worked better than word2vec for age classification while word2vec performed better for gender classification. Akhtyamova et al. <ref type="bibr" target="#b0">[1]</ref> used word embeddings with logistic regression for AP task at PAN-2017. On the other hand, Arroju et al. <ref type="bibr" target="#b3">[4]</ref>, Bartoli et al. <ref type="bibr" target="#b4">[5]</ref>, and Marguardt et al. <ref type="bibr" target="#b13">[14]</ref> used LIWC <ref type="bibr" target="#b26">[27]</ref> for AP at PAN-2017.</p><p>Schler et al. <ref type="bibr" target="#b10">[11]</ref> tried to identify age and gender from the writing style in blogs. The authors used non-dictionary words, parts-of-speech (POS), function words, hyperlinks, combined with content features like unigram with the highest information gain for AP task. Argamon et al. <ref type="bibr" target="#b2">[3]</ref> documented how the variation of linguistic characteristics was responsible for identifying authors age and gender. The authors mainly focused on the functional words with POS features for gender prediction. Holmes et al. <ref type="bibr" target="#b9">[10]</ref> and Burger et al. <ref type="bibr" target="#b6">[7]</ref> performed similar studies by focusing on the extraction of age and gender information from formal text.</p><p>Exhaustive studies performed by Rangel and Rosso <ref type="bibr" target="#b21">[22]</ref> shows that age and gender depend on the use of language. They used stylistic features like frequency, punctuation marks, POS, emoticons, and obtained the best result by SVM classifier on PAN-2013 AP dataset. Another notable work mentioned by the same authors which took emotions into account for AP task on tweets <ref type="bibr" target="#b22">[23]</ref>. They have used EmoGraph, Graph based approaches for identifying gender and age on PAN-2013 AP dataset. In another work, Weren et al. <ref type="bibr" target="#b27">[28]</ref> used information retrieval based features such as information gain and cosine similarity with each category for age and gender identification on PAN-2013 AP dataset.</p><p>The above survey reveals that a variety of features can be used for AP. Many experiments in AP were performed using content-based features like slang words, happyemotion words, sad-emotion words, sentiment words <ref type="bibr" target="#b22">[23]</ref>. In contrast, stylistic features, such as frequency, punctuation, POS, and other different statistics were also used for AP in <ref type="bibr" target="#b8">[9]</ref>. More recently, word embeddings like word2vec and document embeddings like Doc2Vec were used features for AP in addition to bag-of-words and TF-IDF <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b14">15]</ref>.</p><p>AP task at PAN started in 2013 and it focused on age and gender classification for two languages (English, Spanish). AP task at PAN-2014 targeted on the same language sets with four different genres of corpora: social media, blogs, Twitter, and hotel reviews, though hotel reviews dataset was only available for English. This task focuses on age and gender classification of authors. AP task at PAN-2015 extended to four different languages (Dutch, English, Italian, and Spanish) and datasets were collected only from Twitter. AP task at PAN-2016 focused on gender and age classification, and the corpora contain tweets, reviews, blogs and other social media for three different languages (Dutch, English, Spanish). AP task at PAN-2017 focused on Twitter datasets in four different languages (Arabic, English, Spanish, Portuguese) with gender and language variety identification. This time, the AP task at PAN-2018 is performed on three different languages (English, Spanish, Arabic), and the datasets contain tweets and images from Twitter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methodology</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Dataset</head><p>The AP task at PAN-2018 focused on users' gender detection using their tweets and photos shared on Twitter. The organizers provided a training dataset each for the three languages (Arabic, English, and Spanish). Both English and Spanish datasets contain 3000 users' information each (1500: Male, 1500: Female) while Arabic dataset contains 1500 users' information (750: Male and 750: Female). For each user, there are 100 tweets along with 10 images. The organizers also provided the test dataset and the details can be found in the overview paper of AP task <ref type="bibr" target="#b23">[24]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Features</head><p>This section describes several features used in our experiments. Feature selection plays an important role in any machine learning framework and depends upon the dataset used for experiments. The features are as follows:</p><p>Stylistic Features: This is an important feature and has been extensively used in AP tasks <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b21">22]</ref>. The number of stop words, punctuations, happy and sad smilies, tweets or retweets, hashtags, and slangs were considered in the present study. The stop word lists for all three languages were collected from nltk corpus <ref type="foot" target="#foot_2">3</ref> . The slang word list was manually prepared only for English.</p><p>Word Embeddings based Features: Recently, word embeddings gained popularity in text mining and information retrieval, and it has been used in several tasks including AP <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b14">15]</ref>. For the present study, word vector representations were obtained using the word2vec model, GloVe <ref type="bibr" target="#b17">[18]</ref> (global vector for word representation). There are many advantages of GloVe over the traditional word2vec model. First, it is trained on 2 billion tweets, and second, it provides a flexible dimension of feature space. GloVe delivers a single feature vector for each of the words in a tweet and those word vectors were converted to tweet vectors ( − → t i ) using equation 1. Finally, tweet vectors ( − → t i ) were added together to create a single user vector as in equation 2.</p><formula xml:id="formula_0">− → t i = 1 N i Ni ∑ j=1 − → w ij (1) − → U k = 100 ∑ i=1 − → t i (<label>2</label></formula><formula xml:id="formula_1">)</formula><p>where − → t i is tweet vector for i th tweet, w ij is the j th word in i th tweet; N i is the number of words present in GloVe for i th tweet, and − → U k is the k th user vector. The word vectors of dimensions 100 and 200 were used for image captions and tweets. We used word embeddings only for English dataset due to the availability of pre-trained models on tweets.</p><p>Latent Semantic Analysis: LSA is a technique for creating a vector representation of a document, similar to document embeddings or Doc2Vec. It has been successfully used for several applications in Natural Language Processing (NLP) including AP <ref type="bibr" target="#b1">[2]</ref>. The steps for implementing LSA on a set of tweets belonging to a user is as follows. Initially, we calculated TF-IDF vector for tweets of each user and then implemented singular value decomposition (SVD) on TF-IDF vectors to reduce dimensionality. Finally, we implemented LSA to get final vectors for each user on 100 tweets. This feature was used to convert tweets to feature vectors for Arabic and Spanish languages, and to convert hashtags to feature vectors for all three languages.</p><p>Topic Words: It is useful to collect topic words which describe the whole document in few words. We used Latent Dirichlet Allocation (LDA) to collect all the important words for a single user and LDA implemented using gensim 4 was used in the experiments. For a single user, we collected three topics containing 10 words each from 100 tweets. Topic words were converted to feature vectors either using word embeddings (GloVe) or LSA. This feature was used for all three training datasets.</p><p>Hashtags: Hashtags are informative on microblogs such as Twitter. Total of 1777, 48292, and 35018 number of unique hashtags were present in training datasets of Arabic, English, and Spanish languages, respectively. Thus, the extensive use of hashtags in training datasets (except Arabic) motivated us to use it as a feature. We used LSA to get a feature vector from all hashtags used by a single user. We used this feature for all three languages.</p><p>Image Captions: Several state-of-the-art image captioning systems using deep learning are available nowadays. We used an existing image captioning system by Tsutsui and Crandall <ref type="bibr" target="#b25">[26]</ref> to extract information present in images. This image caption generation system provides a detailed image captioning for all images. It also provides captions in Chinese, English, and Japanese. For the present task, we only considered captions in English language. LDA was used to identify topic words from image captions. The image captions and topic words were converted to feature vector using either LSA or word embeddings (GloVe).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">System Architecture</head><p>Figure <ref type="figure" target="#fig_0">1</ref> shows the detailed architecture of text-based AP system for English language. AP system for English used mainly word embeddings, GloVe. It was used to convert topic words and tweet tokens into separate feature vectors. We used different dimensions for different modules of word embeddings. We performed 10-fold cross-validation on training dataset with different vector sizes of GloVe and the maximum accuracy was obtained for feature dimensions of 200 and 100 for tweets and topic words, respectively. Thus, we used similar settings for all experiments. Further, hashtags were also converted to feature vectors by sequentially using TF-IDF, SVD, and LSA as described in section 3.2. We generated 25-dimensional feature vector from all hashtags of a single user.</p><p>Figure <ref type="figure" target="#fig_1">2</ref> describes the detailed architecture of text-based AP system for both Arabic and Spanish. For Arabic and Spanish, no pre-trained word embeddings were available for tweet dataset. Thus, using word embeddings was not an option for both of the languages. We calculated TF-IDF for unigrams and reduced vector size using SVD and then used LSA to get the final feature vectors as described in section 3.2. We identified topic words from tweets of a single user, then implemented a similar method to get feature vectors from topic words. The feature vector from hashtags was extracted using the same method which was used for English language. We generated 200-dimensional vector from tweets and 100-dimensional feature vector from topic words. We also generated 25-dimensional feature vector for hashtags. The image captioning system generates captions in English. Figure <ref type="figure" target="#fig_2">3</ref> describes the architecture of image-based AP system for English while Figure <ref type="figure" target="#fig_3">4</ref> describes the architecture of image-based AP systems for Arabic and Spanish. For English AP system, we collected captions for all images of a single user. We converted all the words (except stop words) into word vectors using GloVe. We identified topic words using LDA and then converted each topic words into word vectors using GloVe. Each word vectors are summed together using equation 1 to get a single vector for a single user. The im-   For Arabic and Spanish AP systems, we collected captions for all images of a single user and then, collected all the words (except stop words). We also identified topic words from the above captions. Finally converted all words and topic words into two 100-dimensional feature vector using LSA as described in Section 3.2. The image captions are in English language; we could have implemented GloVe for all three languages. We wanted AP systems for Arabic and Spanish to be language-independent; thus, a different method was implemented for extracting features from captions to that of English AP system. For image captions, a 200-dimensional feature vector was generated from both captions and topic words. We also developed three multimodal AP systems for three datasets. For the multimodal systems, text and image features were combined together.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results and Discussion</head><p>Initially, several classifiers such as Decision Tree, Random Forest, SVM implemented in scikit-learn <ref type="foot" target="#foot_4">5</ref> were used for 10-fold cross-validation on the training datasets. It was observed that SVM classifier outperformed all other classifiers. Thus, we used SVM for developing all AP systems using text, image, and combination of both. We used the linear kernel for all the experiments. All AP systems are evaluated based on accuracies.</p><p>We submitted trained models and feature extraction codes in the virtual machine, TIRA<ref type="foot" target="#foot_5">6</ref>  <ref type="bibr" target="#b19">[20]</ref>. TIRA provides a means for evaluation as a service <ref type="foot" target="#foot_6">7</ref> . The system performances were calculated on the test dataset in TIRA and the organizers provided the accuracies of the systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Results</head><p>Initially, text and image features were separately used for classification. Later, multimodal systems were developed using a combination of text and image features. The accuracies of text, image and multimodal AP systems for all three languages are presented in Table <ref type="table" target="#tab_0">1</ref>. The maximum accuracy of 0.7586 was obtained for the textual based AP system for Spanish language among all three languages. The maximum accuracy of 0.6918 was obtained for image-based AP system for Spanish language among all three languages. Though the multimodal AP system did not perform well for Spanish language and the main reason may be the curse of dimensionality. The maximum accuracy of 0.7737 was obtained for multimodal AP system for English of all three languages. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Discussion</head><p>The multimodal systems for all three languages outperformed unimodal systems developed using either text or image. This shows the superiority of multimodal dataset over the traditional unimodal dataset. The Arabic language dataset contains 1500 users' data as compared to 3000 users' data for other languages and this may be the main reason for low accuracy of AP systems for Arabic language among all languages. There were no words found in GloVe for many tweets from English dataset and that resulted in a zero vector for those tweets. These tweets contain mostly emoticons or hashtags or miss spelled words. This may be one of the reasons for low accuracy of text-based AP system for English dataset.</p><p>Our system ranked 12 th among 23 participants in AP task at PAN with the average accuracy of 0.7709 for all three multimodal AP systems. The highest accuracy of 0.8198 was obtained by takahashi18 team across all multimodal AP systems. Our AP systems for Arabic and Spanish achieved 10 th rank and for English, it achieved 16 th rank. The multimodal systems obtained the maximum accuracies of 0.8180, 0.8584, and 0.8200 for Arabic, English and Spanish languages by miranda18, takahashi18 and daneshvar18 teams, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>We presented AP systems to identify the gender of users from Arabic, English, and Spanish multimodal datasets. Among three languages, the multimodal AP system for English outperformed other two languages.</p><p>LSA worked well in the case of AP systems for Arabic and Spanish languages; it will be interesting to implement LSA on English dataset. In the future, we will perform several experiments with different word and document embeddings on all datasets. Several other language-independent approaches such as n-grams can be implemented later. We are also planning to implement different deep learning models for gender detection.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. Architecture of text-based AP system for English dataset</figDesc><graphic coords="5,142.12,547.66,331.12,92.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 .</head><label>2</label><figDesc>Figure 2. Architecture of text-based AP systems for Arabic and Spanish datasets</figDesc><graphic coords="6,137.99,291.32,339.37,90.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 .</head><label>3</label><figDesc>Figure 3. Architecture of image-based AP system for English dataset</figDesc><graphic coords="6,134.77,422.47,348.98,90.72" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 .</head><label>4</label><figDesc>Figure 4. Architecture of image-based AP systems for Arabic and Spanish datasets</figDesc><graphic coords="6,134.77,554.14,348.50,85.92" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Accuracies of AP systems developed on English, Spanish and Arabic languages using different feature categories</figDesc><table><row><cell>Languages</cell><cell>Text Features</cell><cell>Image Features</cell><cell>Combined Features</cell></row><row><cell>Arabic</cell><cell>.7430</cell><cell>.6570</cell><cell>.7680</cell></row><row><cell>English</cell><cell>.7558</cell><cell>.6747</cell><cell>.7737</cell></row><row><cell>Spanish</cell><cell>.7586</cell><cell>.6918</cell><cell>.7709</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://pan.webis.de/index.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://pan.webis.de/clef18/pan18-web/author-identification.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://www.nltk.org/book/ch02.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://radimrehurek.com/gensim/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">http://scikit-learn.org/stable/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">http://www.tira.io/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">https://www.uni-weimar.de/en/media/chairs/computer-sciencedepartment/webis/research/activities-by-field/tira/#c41469</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Twitter author profiling using word embeddings and logistic regression -notebook for PAN at CLEF</title>
		<author>
			<persName><forename type="first">L</forename><surname>Akhtyamova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cardiff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ignatov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes for CLEF 2017 Conference</title>
				<meeting><address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">INAOE&apos;s participation at PAN&apos;15: Author profiling task -notebook for PAN at CLEF</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Alvarez-Carmona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>López-Monroy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Montes-Y Gómez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Villasenor-Pineda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">J</forename><surname>Escalante</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Gender, genre, and writing style in formal written texts</title>
		<author>
			<persName><forename type="first">S</forename><surname>Argamon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Koppel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Shimoni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Text -Interdisciplinary Journal for the Study of Discourse</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="321" to="346" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Age, gender and personality recognition using tweets in a multilingual setting -notebook for PAN at CLEF</title>
		<author>
			<persName><forename type="first">M</forename><surname>Arroju</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hassan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Farnadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes for CLEF 2015 Conference</title>
				<meeting><address><addrLine>Toulouse, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">An author profiling approach based on language-dependent content and stylometric features -notebook for PAN at CLEF</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bartoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>De Lorenzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Laderchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Medvet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Tarlao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes for CLEF 2015 Conference</title>
				<meeting><address><addrLine>Toulouse, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Multilingual author profiling using word embedding averages and svms</title>
		<author>
			<persName><forename type="first">R</forename><surname>Bayot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gonçalves</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">10th International Conference on Software, Knowledge, Information Management &amp; Applications (SKIMA)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="382" to="386" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Discriminating gender on twitter</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Burger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Henderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zarrella</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on empirical methods in natural language processing (EMNLP)</title>
				<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="1301" to="1309" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Authorship attribution and gender identification in greek blogs</title>
		<author>
			<persName><forename type="first">P</forename><surname>Dang Duc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Giang Binh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Son Bao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">8th International Conference on Quantitative Linguistics (QUALICO)</title>
				<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Author profiling for english emails</title>
		<author>
			<persName><forename type="first">D</forename><surname>Estival</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gaustad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">B</forename><surname>Pham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hutchinson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">10th Conference of the Pacific Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="263" to="272" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">The handbook of language and gender</title>
		<author>
			<persName><forename type="first">J</forename><surname>Holmes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Meyerhoff</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
			<publisher>John Wiley &amp; Sons</publisher>
			<biblScope unit="volume">25</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Computational methods in authorship attribution</title>
		<author>
			<persName><forename type="first">M</forename><surname>Koppel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Argamon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the Association for Information Science and Technology</title>
		<imprint>
			<biblScope unit="volume">60</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="9" to="26" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Distributed representations of sentences and documents</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1188" to="1196" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Adapting cross-genre author profiling to language and corpus -notebook for PAN at CLEF</title>
		<author>
			<persName><forename type="first">I</forename><surname>Markov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gómez-Adorno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sidorov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">F</forename><surname>Gelbukh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes for CLEF 2016 Conference</title>
				<meeting><address><addrLine>Évora, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="947" to="955" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Age and gender identification in social media</title>
		<author>
			<persName><forename type="first">J</forename><surname>Marquardt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Farnadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Vasudevan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F</forename><surname>Moens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Davalos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Teredesai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>De Cock</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF 2014 Evaluation Labs</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1129" to="1136" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Distributed representations of words and phrases and their compositionality</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">S</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="3111" to="3119" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Feeling may separate two authors: Incorporating sentiment in authorship identification task</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">G</forename><surname>Patra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bandyopadhyay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Natural Language Processing</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="121" to="126" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Automatic author profiling based on linguistic and stylistic features -notebook for PAN at CLEF</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">G</forename><surname>Patra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Saikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bandyopadhyay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes for CLEF 2013 Conference</title>
				<meeting><address><addrLine>Valencia, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Glove: Global vectors for word representation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pennington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on empirical methods in natural language processing (EMNLP)</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1532" to="1543" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Author profiling for vietnamese blogs</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Pham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">B</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">B</forename><surname>Pham</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Asian Language Processing</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="190" to="194" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Improving the reproducibility of pans shared tasks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gollub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference of the Cross-Language Evaluation Forum for European Languages</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="268" to="299" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Topic models and n-gram language models for author profiling -notebook for PAN at CLEF</title>
		<author>
			<persName><forename type="first">A</forename><surname>Poulston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stevenson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bontcheva</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes for CLEF 2015 Conference</title>
				<meeting><address><addrLine>Toulouse, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Use of language and author profiling: Identification of gender and age</title>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">10th International Workshop on Natural Language Processing and Cognitive Science</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="177" to="186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">On the impact of emotions on author profiling</title>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information processing management</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="73" to="92" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Overview of the 6th author profiling task at PAN 2018: Multimodal gender identification in twitter</title>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Montes-Y Gómez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">CLEF 2018 Labs and Workshops</title>
		<title level="s">Notebook Papers. CEUR Workshop Proceedings</title>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Author profiling for english and spanish text -notebook for PAN at CLEF</title>
		<author>
			<persName><forename type="first">U</forename><surname>Sapkota</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Solorio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Montes-Y Gómez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ramírez-De-La Rosa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes for CLEF 2013 Conference</title>
				<meeting><address><addrLine>Valencia, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">C</forename><surname>Satoshi Tsutsui</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR Language and Vision Workshop</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">The psychological meaning of words: Liwc and computerized text analysis methods</title>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">R</forename><surname>Tausczik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Pennebaker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of language and social psychology</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="24" to="54" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Examining multiple features for author profiling</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">R</forename><surname>Weren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">U</forename><surname>Kauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Mizusaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">P</forename><surname>Moreira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P M</forename><surname>De Oliveira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">K</forename><surname>Wives</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of information and data management</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="266" to="279" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
