<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Celebrity Profiling on Twitter using Sociolinguistic Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luis Gabriel Moreno-Sandoval</string-name>
          <email>morenoluis@javeriana.edu.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edwin Puertas</string-name>
          <email>edwin.puertas@javeriana.edu.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Flor Miriam Plaza-del-Arco</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandra Pomares-Quimbaya</string-name>
          <email>pomares@javeriana.edu.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jorge Andres Alvarado-Valencia</string-name>
          <email>jorge.alavarado@javeriana.edu.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L. Alfonso Ureña-López</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center of Excellence and Appropriation in Big Data and Data Analytics</institution>
          ,
          <addr-line>CAOBA</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pontificia Universidad Javeriana</institution>
          ,
          <addr-line>Bogotá</addr-line>
          ,
          <country country="CO">Colombia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Tecnológica de Bolívar</institution>
          ,
          <addr-line>Cartagena</addr-line>
          ,
          <country country="CO">Colombia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universidad de Jaén</institution>
          ,
          <addr-line>Jaén, Andalucía</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>Social networks have been a revolutionary scenario for celebrities because they allow them to reach a wider audience with much higher frequency than using traditional means. These platforms enable them to improve or sometimes deteriorate, their careers through the construction of closer relationships with their fans and the acquisition of new ones. Indeed, networks have promoted the emergence of a new type of celebrities that exists only in the digital world. Being able to characterize the celebrities that are more active on social networks, such as Twitter, gives an enormous opportunity to identify what is their real level of fame, what is their relevance for an age group, or a specific gender or occupation. These facts may enrich decision making, especially in advertising and marketing. To achieve this aim, this paper presents a novel strategy for the characterization of celebrities profile on Twitter based on the generation of socio-linguistic features from their posts that serve as input to a set of classifiers. Specifically, we produced four classifiers that describe the level of fame, the gender, the birth date, and the possible occupation of a celebrity. We obtained the training and test data sets as part of our participation at PAN 2019 at CLEF. Results of each classifier are reported including the analysis of which features are more relevant, which classification techniques were more useful and which were the final precision and recall results.</p>
      </abstract>
      <kwd-group>
        <kwd>celebrity profiling</kwd>
        <kwd>socio-linguistic feature</kwd>
        <kwd>user profiling</kwd>
        <kwd>computational linguistic</kwd>
        <kwd>natural language processing</kwd>
        <kwd>author profiling</kwd>
        <kwd>twitter</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Author profiling is a sub task of authorship analysis whose objective is the analysis of
shared content in order to predict different characteristics of authors such as gender,
age, personality or native language [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Knowing the profile of an author could be of vital importance in multiple areas.
In marketing to understand what types of people like or dislike some products and
analyzing their online reviews [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. In safety to identify psychological traits that allow
to detect profiles with abnormal behaviors that may cause harm to other users [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or to
discover fake profiles (one person can have multiple profiles for fraudulent and other
misdeeds)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        With the increasing usage of social media and the rapid expansion of user generated
content, author profiling task has gained a lot of interest in the last years.This task
is a research topic in the natural language processing community on which various
shared tasks have been generated recently. Perhaps one of the best-known shared tasks
is the one organized at PAN [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] at the Conference and Labs of the Evaluation Forum
(CLEF)5 since 2013. Specifically, the focus has been on gender and age identification.
      </p>
      <p>
        Social media have meant a real revolution for famous people, that is, celebrities,
such as artists, sportsmen, among others, who take advantage of Facebook, Twitter or
other platforms to get closer to their fans, and in turn, get a new way to earn income
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Studying the profile of each celebrity allows us to extract certain characteristics,
such as the vocabulary they use to refer to their profession, the way of writing, the way
of communicating with their fans, and their possible age or profession.
      </p>
      <p>
        In this paper, we describe our submission as part of our participation at PAN 2019
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] at CLEF. In particular, we have participated in the celebrity profiling task [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. It
is the first year that this task is organized and it consists of determining the degree of
fame, occupation, age, and gender of a celebrity, given his/her social media feed. Our
main contribution is to generate and analyze specific features from celebrity of digital
social networks and incorporate them into different machine learning classifiers.
      </p>
      <p>The rest of the paper is structured as follows. In Section 2, we introduce the relate
work. In Section 3, we explain the data set used in our strategy for celebrity
characterization. Section 4 presents the details of the proposed strategy. In Section 5 and 6, we
discuss the analysis and evaluation results. We conclude in Section 7 with remarks and
future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Early research on the profile of authors focused mainly on formal texts and blogs.
However, today’s researchers focus primarily on social media platforms such as Twitter of
Facebook, where the language is less formal and users post messages continuously [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
The contribution of the different researchers who used the PAN datasets is remarkable.
      </p>
      <p>
        Most of the strategies presented at PAN used combinations of features based on
styles such as frequency of punctuation marks, capitalization, together with
part-ofspeech tags and content-based features such as bag of words, dictionary-based words,
5 http://clef-initiative.eu
topic-based words, entropy-based words or term frequency inverse document frequency
(TF-IDF) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In the Author Profiling Task at PAN 2017 and PAN 2018, more
participants employed deep learning techniques, which perform automatic feature selection.
However, in the gender and language variety subtasks, the best performances belonged
to a logistic regression classifier with combinations of character, word, and POS
ngrams, emojis, sentiments, character flooding, an SVM trained with combinations of
character and TF-IDF n-grams [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Basile et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] used word unigram and character
n-grams. They extracted character three to five grams and word unigrams to bigrams
with TF-IDF weighting. Authors in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] combined POS (Part Of Speech) tags n-grams
with syntactic dependencies to model the use of amplifiers, verbal constructions,
pronouns, subjects and objects, types of adverbials, as well as the use of interjections and
profanity. The authors in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] used the counts of stopwords, punctuation marks,
emoticons, and slang words.
      </p>
      <p>
        Copland et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] showed that through the study of the "personal pronoun", specially
the use of "me" and "us" is possible to identify important sociolinguistic variables.
These variables can be associated with the social status of a person and the school from
which the person comes (e.g. Christian, Lutheran). Hence some of the approaches we
applied to identify features of celebrities take into account the use of personal pronouns
in the texts.
      </p>
      <p>
        Regarding the machine learning approaches, the most commonly used classifiers
have been Logistic Regression [
        <xref ref-type="bibr" rid="ref12 ref8">12,8</xref>
        ], Support Vector Machines [
        <xref ref-type="bibr" rid="ref1 ref19">19,1</xref>
        ], Multilayer
Perceptron [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and distance-based methods.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data Description</title>
      <p>
        The training dataset of the celebrity profiling task at PAN 2019 [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] consists of English
tweets with the following features: degree of fame, occupation, age, and gender that
includes 48335 user profiles with 2181 tweets avg. per user. Table 1 provides details
about some attributes of the dataset. The task is to predict four traits of a celebrity from
their social media communication.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>System Description</title>
      <p>In this section, we describe the predictive model used in our submission. The model
used for the task of profiling celebrities at PAN 2019 was designed to identify four
types of classes: profession, gender, fame and year of birth. In accordance with the
characteristics of the data set and the goals of the task we defined four hypotheses,
which are described in detail in Table 2.</p>
      <p>In addition, for each of the hypotheses, two types of strategies were used. The first
strategy is related to the vocabulary associated with the words of all tweets. For the other
strategy, tweets statistics were generated by user profiles to determine the global use of
words, hashtags, mentions, URLs, and emoji. Taking into account the above-mentioned
assumptions and strategies.</p>
      <p>On the basis of the proposed hypotheses and strategies, the "Training System" was
designed. Figure 1 shows the proposed system to predict celebrities, which consists of
the following stages: preprocessing, standardization and transformation, extraction of
characteristics, configuration and classifiers, and testing.
4.1</p>
      <sec id="sec-4-1">
        <title>Preprocessing</title>
        <p>In the preprocessing stage, we use the concatenated vocabulary of each user’s tweets,
in order to have only one document per user profile. In addition, the re-labeling of the
hashtags is applied, which was done with the word "label_hashtag", the mentions word
with the word "label_mention", the URLs with the word "label_url", and the emojis by
UTF-8 were replaced with the word "label_emoji". Finally, globally re-tagged words
are searched and counted.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Normalization and Transformation</title>
        <p>The next stage is associated with normalization and transformation process. The
normalization process is related to the balance of the classes and the generation of random
samples for the training and testing process. With respect to the transformation process,
the vector representation of words is performed and the use of the features for each user
profile is calculated. This process can be configured in such a way that the vectorial
representation of the words can be done with "N-gram" and the global features related
to the tweets of the user profiles can also be parameterized.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Feature Extraction</title>
        <p>The features based on the use of words, hashtags, mentions, URLs, and emojis that are
calculated for each one of the tweets by profile in the celebrity system are Table 3.
Class</p>
        <p>Description - H0
Profession The profession is mainly associated with the use of "specialized" vocabulary.</p>
        <p>Therefore, the classification process must be based on the vocabulary collected
by each profession.</p>
        <p>Gender
Fame</p>
        <p>In gender, we want to establish features for the use of emojis, hashtags,
mentions, RT and URLs. For this, it is expected that the features associated with the
words added to those found in the user profiles will improve the classifications.
Fame is perhaps the most important label in establishing features such as the
use of emojis, hashtags, mentions, RT and URL. In addition, it is verified if the
message is written in first, second or third person. With the above, it is expected
that the features associated with the words added to those found in the usage
profiles will improve the classifications.</p>
        <p>Birth years This label is perhaps the most difficult to classify because the wide range of
years from 1940 to 2011. For this reason, groups were established in order to
generate features of use of emojis, hashtags, mentions, RT and URLs. Also, it
was contemplated if the message was written in first, second and third person.
With the above, it is expected that the features associated with the words added
to those found in the usage profiles will improve the classifications.</p>
        <p>These metrics allow us to see the distribution of each feature in the profile, and for
some of them kurtosis and asymmetry are calculated.</p>
        <p>It measures have them as a complement of the averaged data in the associated topics
of the size of each word or the number of words per tweet. The main idea is to be able
to have a real form of these two measures given that the average may not show the
complete information. The rest of the average features regarding the characteristics of
a social network such as hashtags, mentions, emojis, URLs, and retweets are also used
in the profile.</p>
        <p>
          The lexical diversity was represented using the feature Text-Type Ratio (TTR) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
This measure allows us to see what is the use of vocabulary concerning all the words
included in the texts, which we think is very useful for detecting bots or specific kind
of people. Finally, the use of the first, second, or third person, singular or plural could
also show us social characteristics.
4.4
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Settings and classifiers</title>
        <p>At the configuration stage, the system will adjust machine hardware parameters such as
processors and threads. In addition, different scenarios can be configured for the use of
the classifiers. Finally, the system may be adjusted to store the best performing vector
words and qualifiers. It should be noted that during the execution of the system, the data
set was divided into 60 % for training and 40 % for tests for all our experiments.</p>
        <p>On the other hand, based on the previous tasks carried out in the PAN, several
classifiers were examined, such as Naive Bayes (NB), Gaussian Naive Bayes (GNB),
Naive Bayes Complement (CNB), Logistic Regression (LR), and Random Forests (RF).
In the test stage, a software component that performs the following activities was
developed. First, the test data sets are read. The tweets are processed by each user.
Afterwards, the features of the use are calculated. As shown in Table 4, different models
were created looking for the best classifications. Subsequently, vector representation is
made. The best classifiers for Fame, Birthyear, Occupation, and Gender classes are then
calculated. Finally, the best predictors are exported. Figure 2 shows the "System Test"
used by our models.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments and Analysis of Results</title>
      <p>During the pre-evaluation phase, we carried out different experiments, and the best ones
were taken into account for the evaluation phase. The system has been evaluated using
the usual competition metrics, including Accuracy (Acc), Precision (P), Recall (R) and
# Feature Description
1 stats_avg_word Average word size per tweet
2 stats_kur_word Kurtosis of the variable stats_avg_word
3 stats_label_emoji Amount of emojis per tweet for the profile
4 stats_label_hashtag Number of hastags per tweet for the profile
5 stats_label_mention Number of mentions per tweet for the profile
6 stats_label_url Number of urls per tweet for the profile
7 stats_label_retweets Number of retweets per tweet for the profile
8 stats_lexical_diversity Lexicon diversity for all tweets by profile
9 stats_label_word Number of words per tweet for the profile
10 kurtosis_avg_word Kurtosis of the variable stats_kur_word
11 kurtosis_label_word Kurtosis of the variable stats_label_word
12 skew_avg_word Statistical asymmetry of the variable stats_avg_word
13 skew_label_word Statistical asymmetry of the variable stats_label_word
14 stats_person_1_sing Number of tweets used by the first person of the singular
15 stats_person_2_sing Number of tweets used by the second person singular
16 stats_person_3_sing Number of tweets used by the third person singular
17 stats_person_1_plu Number of tweets used by the first and second person of the plural
18 stats_person_3_plu Number of tweets used by the third person plural</p>
      <p>F1-score (F1). The best systems in the pre-evaluation phase will be explained in detail
in the following sections.</p>
      <p>As can be seen in Table 5, the summary shows the performance of each label
calculated for the challenge. For each label, it is observed the best classification model, the
accuracy obtained with it and the features that worked best for the classification. The
classifiers that obtained the best performance were Logistic Regresion and Multinomial
Naive Bayes. Finally, it describes the pre-processing performed, whether the dataset has
been cleaned or not, whether the 18 characteristics have been used in the classification
and the minimum word frequency in the vector words.</p>
      <p>
        It should be noted that the system presented was trained and tested with the celebrity
dataset provided by the official site of PAN 2019 [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Also, the presentations were
made on the TIRA [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] platform in which we configure a virtual server with ten
processors; we set up the environment to perform the tests [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
5.1
      </p>
      <sec id="sec-5-1">
        <title>Fame classification</title>
        <p>The variable fame is perhaps the most important for the competition. As it was raised in
the hypothesis the 18 proposed features has an impact on its classification . The results
obtained for this variable are as follows. For the model that was evaluated with the 18
proposed features it has an accuracy of 0.65. While the model that was evaluated only
with the traditional bag-of-words has an accuracy of 0.51. The results show that the
features used in our model describe this variable more accurately.</p>
        <p>As can be seen in Table 6, this variable achieved the best performance with the
logistic regression classifier where the proposed features were used. Moreover, we
performed the following steps: pre-processing, standardization, cleaning, and re-labeling.
Finally, only words with frequencies higher than three were taken into account in the
vocabulary counting matrix.
The gender variable has an additional variation because it includes an extra non-binary
variable. The inclusion of the non-binary variable presented us with a significant
challenge because there was a significant imbalance with this new value. Based on the
above, it was hypothesized that the proposed use of the characteristics would have an
impact on the classification. Subsequently, it was corroborated that the addition of the
non-binary viable has a significant effect on the model, given that it describes it
extensively with an accuracy of 0.36. And with the addition of 18 characteristics to the
model, it resulted in an accuracy of 0.88.</p>
        <p>As can be seen in Table 7, this variable achieved the best performance with the
logistic regression classifiers where the features were used. Moreover, we performed
the following steps: pre-processing, normalization, cleaning, and re-labeling. Finally,
only words with frequencies greater than 9 are taken into account in the vocabulary
counting matrix.
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Birth year classification</title>
        <p>as it was proposed in the hypothesis, the birth year model have a better accuracy when
using the new features proposed. The result of the words vector model, adding the
features increases accuracy to 0.37. The addition of the 18 features in the model gave a
gain of 0.29 in the accuracy.</p>
        <p>For the birth year classification we discretized the variable using a a window size
m based on the birth year. The value increases linearly from about 2 years for 2012 to
about 9 years for 1940.</p>
        <p>As can be seen in Table 8, this variable achieved the best performance with the
logistic regression classifier where the features were used. Moreover, we performed the
following steps: pre-processing, normalization, cleaning, and re-labeling. Finally, in
the vocabulary counting matrix, only words with frequencies higher than six are taken
into account, and the significant imbalance of this variable also led to an oversampling
process.</p>
        <p>In this variable, we have evaluated the accuracy with the data initially delivered by
the competition, and we did not use any additional changes for the final evaluation.
The occupation variable, as we said in the initial hypothesis, is initially based on
specialized vocabulary. However, we did not use new approaches such as embeddings,
ontologies or other technologies. The results showed that using user profile
identification in the occupation variable calculation, did not significantly affect it. After applying
the vocabulary and the 18 characteristics to the models used, an accuracy of 0.57 was
obtained.</p>
        <p>As can be seen in Table 9, this variable achieved the best performance with the
Multinomial Naive Bayes classifier where only the vocabulary was used without any
cleaning. Finally, only words with frequencies higher than three are taken into account
in the vocabulary counting matrix.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Result Test</title>
      <p>As shown in Table 10, the models were tested using the training dataset, the test1 dataset
and the test2 dataset. In the ranking of the task, we occupied the second position.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Work</title>
      <p>The task of celebrities CLEF-PAN 2019 generated several challenges that are worth
highlighting. First, we have four classes to calculate a celebrity, but the number of
values that had each of them was a problem. The most critical was the birth year class
in which its dimensionality was reduced, creating groups of profiles every ten years for
better accuracy in the classification.</p>
      <p>On the other hand, the training dataset of celebrities had an evident imbalance in
some of the classes. For example: the birth year was imbalanced, gender has only 32
samples of the type non-binary, and finally occupation values like religion had only 35
samples. Some of these challenges were solved with strategies of balancing examples
by performing oversampling.</p>
      <p>The volume of data was another important challenge, it was necessary to process
more than 53 million tweets associated with the profiles analyzed. To deal with that, we
work on a cluster of 10 servers.</p>
      <p>The novelty in the analysis presented in this paper is to analyze specific features
of digital social networks for each profile. The use of sociolinguistic features in the
user profile has shown many quirks in topics social, cultural, and of gender. These
characteristics describe the sociolect of celebrities linked in this study; we also find it
is essential to understand if the text was written in the first, second or third person, and
the lexical diversity that each profiles had.</p>
      <p>As future work, we plan to analyze the models with real samples with a similar
or greater volume of messages. Finally, we want to review the posts and context data
to have models that respond socially to variables that represent real phenomena in the
network.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknoledgments</title>
      <p>We thank the Center for Excellence and Appropriation in Big Data and Data
Analytics (CAOBA), Pontificia Universidad Javeriana, and the Ministry of Information
Technologies and Telecommunications of the Republic of Colombia (MinTIC). The models
and results presented in this challenge contribute to the construction of the research
capabilities of CAOBA. Also, Fondo Europeo de Desarrollo Regional (FEDER),
REDES project (TIN2015-65136-C2-1-R) and LIVING-LANG project
(RTI2018-094653B-C21) from the Spanish Government. Finally, the author Edwin Puertas gives thank
Universidad Tecnológica de Bolívar. Needless to say, we thank the organizing
committee of PAN, especially Paolo Rosso, Francisco Rangel, Matti Wiegmann and Martin
Potthast for their encouragement and kind support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aragón</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>López-Monroy</surname>
            ,
            <given-names>A.P.:</given-names>
          </string-name>
          <article-title>A straightforward multimodal approach for author profiling</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          )
          <article-title>(</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwyer</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Medvedeva</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rawee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haagsma</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nissim</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>N-gram: New groningen author-profiling model</article-title>
          .
          <source>arXiv preprint arXiv:1707.03764</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Copland</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaw</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snell</surname>
          </string-name>
          , J.:
          <article-title>Linguistic ethnography: interdisciplinary explorations</article-title>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavancas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zangerle</surname>
          </string-name>
          , E.: Overview of PAN 2019:
          <article-title>Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Crestani,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Heinatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ). Springer (Sep
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fatima</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anwar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nawab</surname>
            ,
            <given-names>R.M.A.</given-names>
          </string-name>
          :
          <article-title>Multilingual author profiling on facebook</article-title>
          .
          <source>Information Processing &amp; Management</source>
          <volume>53</volume>
          (
          <issue>4</issue>
          ),
          <fpage>886</fpage>
          -
          <lpage>904</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ferrari</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Consoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Building accurate hav exploiting user profiling and sentiment analysis</article-title>
          .
          <source>arXiv preprint arXiv:1609.07302</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoppe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Tira: Configuring, executing, and disseminating information retrieval experiments</article-title>
          .
          <source>In: 2012 23rd International Workshop on Database and Expert Systems Applications</source>
          . pp.
          <fpage>151</fpage>
          -
          <lpage>155</lpage>
          . IEEE (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>HaCohen-Kerner</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yigal</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elyashiv Shayovitz</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breckon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Author profiling: Gender prediction from tweets and images (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Karlgren</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esposito</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gratton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanerva</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Authorship profiling without using topical information: Notebook for pan at clef 2018</article-title>
          .
          <source>In: 19th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF</source>
          <year>2018</year>
          , Avignon, France, 10
          <year>September 2018</year>
          through
          <issue>14</issue>
          <year>September 2018</year>
          . vol.
          <volume>2125</volume>
          .
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Khamis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welling</surname>
          </string-name>
          , R.:
          <article-title>Self-branding,'micro-celebrity'and the rise of social media influencers</article-title>
          .
          <source>Celebrity Studies</source>
          <volume>8</volume>
          (
          <issue>2</issue>
          ),
          <fpage>191</fpage>
          -
          <lpage>208</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>McCarthy</surname>
            ,
            <given-names>P.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jarvis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>: vocd: A theoretical and empirical evaluation</article-title>
          .
          <source>Language Testing</source>
          <volume>24</volume>
          (
          <issue>4</issue>
          ),
          <fpage>459</fpage>
          -
          <lpage>488</lpage>
          (
          <year>2007</year>
          ), https://doi.org/10.1177/0265532207080767
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nieuwenhuis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilkens</surname>
          </string-name>
          , J.:
          <article-title>Twitter text and image gender classification with a logistic regression n-gram model</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          )
          <article-title>(</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Use of language and author profiling: Identification of gender and age</article-title>
          .
          <source>Natural Language Processing and Cognitive Science</source>
          <volume>177</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gómez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farías</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cagnina</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaghouani</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charfi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A survey on author profiling, deception, and irony detection for the arabic language</article-title>
          .
          <source>Language and Linguistics Compass</source>
          <volume>12</volume>
          (
          <issue>4</issue>
          ),
          <year>e12275</year>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Overview of pan 2018:
          <article-title>Author identification, author profiling, and author obfuscation</article-title>
          .
          <source>In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 9th International Conference of the CLEF Association</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2018</year>
          . Avignon, France, September
          <volume>10</volume>
          -14/Bellot, Patrice [edit.]; et al. pp.
          <fpage>267</fpage>
          -
          <lpage>285</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Tellez</surname>
            ,
            <given-names>E.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranda-Jiménez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moctezuma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graff</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salgado</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ortiz-Bejar</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Gender identification through multi-modal tweet analysis using microtc and bag of visual words</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          )
          <article-title>(</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the Celebrity Profiling Task at PAN 2019</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          , H. (eds.)
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR-WS.org (Sep</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>