<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Word Distance Approach for Celebrity Profiling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Usman Asif</string-name>
          <email>usmanasifweb@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Naeem</string-name>
          <email>naeemshahzad7075@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zeeshan Ramzan</string-name>
          <email>zramzan@uet.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fahad Najib</string-name>
          <email>fahad.najib@uet.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Engineering and Technology, Lahore, KSK Campus</institution>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes and evaluates a model for Celebrity Profiling 2019 dataset. The training data set contain 33,836 celebrities' text with 50 different languages. The task was to create a model for this textual complex dataset which predict gender (male, female, nonbinary), fame (star, superstar, rising), occupation (sports, performer, creator, professional, manager, science, politics, religious) and birthyear (1940-2011) of celebrity. We use word distance features as input to different classifiers for different aspects (gender, fame, occupation and birthyear) of celebrity to create models. Results showed that word distance-based features outperformed the PAN baseline results.</p>
      </abstract>
      <kwd-group>
        <kwd>Celebrity Profiling Text Classification Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Celebrity profiling task [14] offerd by PAN’19 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is to predict the celebrity predict
gender (male, female, nonbinary), degree of fame (star, superstar, rising), occupation
(sports, performer, creator, professional, manager, science, politics, religious) and
birthyear (1940-2011) from celebrities’ tweets written in 50 different languages. This task
was offered by PAN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The dataset [13] for both training and testing of models was
given by PAN. The complete dataset contains tweets of 48,335 celebrity users. The
training dataset consists of tweets of 33,836 users, and rest of the users’ tweets were
included in test dataset. The prediction of properties containing many labels e.g., birthyear
contain 71 label classes and occupation contain 8 label classes, makes the task more
challenging.
      </p>
      <p>
        Almost all celebrities use the twitter and tweets there. The task has importance in
social media and in celebrity industry for predicting the celebrity properties like
gender, birthyear, occupation, fame by using their tweets. To measure these properties of
celebrities from their tweets is significant for the celebrity fans, social media and
industry. Knowing users’ demographics from their written text has also applications in
marketing as brands could increase reach of their message to more relevant audience
[10,12]. The problem of celebrity traits predictions has also applications in forensic
[
        <xref ref-type="bibr" rid="ref4">8,4</xref>
        ] because of increasing cases of cyber crime including sexual harassment,
threatening, identity theft etc.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The problem of predicting different personality traits from text, due to it’s applications
in various other problem, have gain lot attention from community. Previously doc2vec
document embedding technique used to train SVM and logistic regression classifier
[
        <xref ref-type="bibr" rid="ref1 ref2 ref5">2,5,1</xref>
        ]. RUSProfiling (Cross-Genre Gender Identification in Russian texts) used
character n-grams, word n-grams and gender specific Russian grammatical features to train
multinomial Naive Bayes, logistic regression, random forest, and ensemble classifier
for gender identification [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The problem to identification of author’s traits from his/her
written text have been addressed by using stylistic features to train different Machine
Learning classifiers e.g., J48, Logistic Regression, Random Forest and Naive Bayes
[11]. Different feature representations including raw frequency, binary, normalized
frequency, tf-idf and second order attributes (SOA) have been used in combination with
different machine learning algorithms including multinomial naive Bayes, Support
Vector Machines (SVM), logistic regression [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Corpus</title>
      <p>
        The PAN’19 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] Celebrity Profiling [14] Dataset [13] contains twitter data of total
48,335 User Profiles. These tweets belongs to 50 different languages. A subset of this
dataset, tweets of 33,836 users, used for the purpose of training models, whereas,
remaining dataset consisting of 14,499 user profiles is used for testing of trained models.
The complete training datasets consists of a single ndjson file in which tweets of all
33,836 user profiles/celebrities are present.
      </p>
      <p>The corpus contains tweets, grouped by user/celebrity and labeled with gender
(male, female, nonbinary), degree of fame (star, superstar, rising), occupation (sports,
performer, creator, professional, manager, science, politics, religious) and birthyear
(1940-2011).</p>
      <p>The corpus was not balanced (See Figure 1). In case of gender, more than 50%
profiles are of male celebrities, whereas, only 32 users belongs to nonbinary. Similarly,
huge proportion of user profiles are stars, whereas, the frequency of rising and
superstars is very low. Same is the case with occupation, where, there are sufficient instances
of sports, performer and creator, whereas, remaining categories are in minority. The
corpus is also unbalanced in case of birthyear (See Figure 2).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>We use the word distance approach for training models to predict different personality
traits of celebrities. We made 200 (4 * 50) models as corpus contains tweets of 50
different languages and we have to predict fours aspects of user profile for each user profile.
Each model predicts the specific class / personality trait for specific language such as
English gender model predicts gender of user whose tweets are in English language.
Each model has been trained using different sets of features and classifier.
4.1</p>
      <sec id="sec-4-1">
        <title>Pre-processing</title>
        <p>As corpus contains tweets written in 50 different languages, we put the same language
tweets in same file by using langdetect module of Python. In this way, whole corpus was
divided into 50 ndjson files such as en.ndjson, ar.ndjson files for languages English and
Arabic. After separation we examine that almost 93% tweets are of English language
and remaining 7% are non-English so made two categories English and non-English
corpus.</p>
        <p>After separating tweets of different languages, we applied different technique for
data cleaning and features extraction for training models. There were lots of emojis,
tag words, stop words, punctuation words, numbers, alphanumeric words, links, URLs,
short form words, repeating characters words with punctuation’s marks and escape
characters. First of all, we removed all links / URLs from corpus by using a regular
expression. Then, we extracted words from the text by tokenizing text using word tokenizers.
We used the language specific word tokenizers for the purpose of tokenizing text into
words. If, we could not find any word tokenizer for any language then we tokenized the
text with space delimiter. We made a set of unique tokens. After that, we excluded all
punctuation marks, stop words, numbers, alphanumeric words, URLs. Then we remove
all escape characters and hash tag (#), @, spaces, brackets etc., from the word string
and made the words clean. That’s how we get the cleaned set of words.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Features Selection</title>
        <p>After data cleaning and pre-processing, we made a dictionary for each personality trait
for each language, which yielded 200 (4 * 50) dictionaries. The key of each of these
dictionaries is a word and value of this dictionary is a list. Length of list depends upon
the number of labels in the class (gender, fame, occupation or birthyear) to predict. Let
if we want to make a model to predict gender then the list’s first index (0 index) give
count of male users in corpus who used this word (key in the dictionary) in their tweets.
Same, second index for female and third index for non-binary (See Figure 3).
– n1: Number of males in corpus who uses this word in their tweet
– n2: Number of females in corpus who uses this word in their tweet
– n3: Number of non-binaries in corpus who uses this word in their tweet
Same process was followed to create dictionaries for the fame, occupation and
birthyear classes. For example, the range of birthyear is from 1940 to 2011. It contains 71
possibilities, so the list length would be 71.</p>
        <p>The most important and tricky part of feature selection was to filter most
distinguishing features from the dictionaries created in last step. For each word, we checked
that which label / class (male, female, non-binary) is using this word most. If all classes
are using a word with almost comparable frequency, then it is common word and we
will not choose it as feature to train model. But if one label / class is using this word
most and others are using it least, then we will choose this word as feature. We can say
that, we will choose such words, for which, one label class has greater distance in count
from other classes. For this we designed strategy to calculate how much maximum
distance, a word is creating. As in total corpus we have let say 60% men, 30% women and
10% non-binary tweets so men will always dominate the women and non-binaries. For
this we multiplied the count (n1, n2, n3) of list with the corresponding ratio. Equations
1, 2 and 3 shows the formulas to calculate the ratio to be multiplied with the count.</p>
        <p>ratio_male =
ratio_f emale =
total number of tweets in all corpora
total number of tweets of male users
total number of tweets in all corpora
total number of tweets of female users
ratio_nonbinary =</p>
        <p>total number of tweets in all corpora
total number of tweets of non-binary users
(1)
(2)
(3)</p>
        <p>After calculating the ratio, we multiplied the ratio number with each word’s count
list (n1, n2, n3). New structure of key-value pair of dictionary is represented below:
W ord : [n1 ratiomale; n2 ratiofemale; n3 rationon binary]</p>
        <p>After multiplying the ratio, the problem created because of unbalanced dataset or
dominating class is somehow solved. After this we calculated the difference created by
the highest value of count with other counts in the list. For this we picked the highest
count value in the list let say n1 has the highest value in list, then calculated it’s
difference with other values in the list. At the end add the all differences. After adding we get
a number which is the distance of that word. Let’s say n1 has highest value in list (n1,
n2, n3), the word distance would be calculated using Equation 4.</p>
        <p>W ord_Distance = (n1
n2) + (n1
n3)
(4)</p>
        <p>After calculating distance of each word, the dictionary would now contain words as
keys and their respective distance as value</p>
        <p>Dictionaryfword1 : word1_distance; word2 : word2_distanceg</p>
        <p>Now, we sorted this dictionary in reverse direction. The large size of dictionary
made it challenging to sort dictionary. Therefore, we get the list of all values form
dictionary and sorted it in reverse order and then deleted the low distance values. After
sorting list, we picked top scoring values and got the corresponding words from the
dictionary and selected them as features.</p>
        <p>After extracting features, we created CSV files to pass it to the Machine Learning
algorithms to train model.
The sklearn implementations of various Machine Learning algorithms were applied on
CSV files created in last step. We used 80% data for training the model and 20% for
testing. We applied six different algorithms (See Table 1) to train models. Then tested
the models with 20% testing data. We selected highest scoring algorithms for training
the model using 100% available data.
The performance of our proposed for Individual traits was judged by F1 measure (See
Equation 5). Whereas, overall performance of the system would be judged by a
combined metric cRank, which is harmonic mean of each label’s metric (See Equation 6).</p>
        <p>F1 =
2 precision recall</p>
        <p>precision + recall
cRank =</p>
        <p>4
1 1 1 1
F1;fame + F1;occupation + F1;gender + F1;age
(5)
(6)
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results and Analysis</title>
      <p>The results of our proposed approach on Training dataset are presented in Table 2.
Table 2 shows the F-measure of all traits of training dataset. The F-measure in case
of occupation and fame is much higher than other two traits. The training dataset is
somehow more balanced in case of occupation and fame than other two traits could
be the reason for such results. Moreover, the birth year range is 1940-2011 and there
are not enough user profiles for non-English to cover all these birthyears (1940-2011).
These problems with the training dataset made it very challamging to correctly predict
birthyear. The cRank (See Equation 6) on Training dataset, the combined score of all
traits, is 0.604653.</p>
      <p>Table 3 presents results obtained by applying our proposed technique on test dataset
using TIRA [9]. These results show that our technique could not perform well on test
dataset as compared with training dataset. The features, the list of words used for
training, were extracted from train dataset, which were not necessarily present in test dataset
with comparable frequency. This limitation of this approach resulted in over-fitting by
giving very promising results on training dataset but not on test dataset.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we have explained a technique for the prediction of the celebrities’
gender, fame, occupation and birthyear from their tweets. It has applications in various
fields like forensics, marketing and security. We trained models on the training data
provided by the PAN organizers. The results we achieved on are pretty good. In future,
more performance can e achieved by making training dataset more balanced and well
representative of the population. Moreover, more sophisticated features, which are not
specific to training dataset, can also improve results.
8. Peng, J., Choo, K.K.R., Ashman, H.: Bit-level n-gram based forensic authorship analysis
on social media: Identifying individuals from linguistic profiles. Journal of Network and
Computer Applications 70, 171–182 (2016)
9. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research
Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World
Lessons Learned from 20 Years of CLEF. Springer (2019)
10. Rambocas, M., Gama, J., et al.: Marketing research: The role of sentiment analysis. Tech.</p>
      <p>rep., Universidade do Porto, Faculdade de Economia do Porto (2013)
11. Sittar, A., Ameer, I.: Multi-lingual author profiling using stylistic features. In: FIRE (2018)
12. Ting, T.C., Davis, J., Pettit, F.A.: Online marketing research utilizing sentiment analysis and
tunable demographics analysis (2014), uS Patent 8,694,357
13. Wiegmann, M., Stein, B., Potthast, M.: Celebrity Profiling. In: Proceedings of ACL 2019 (to
appear) (2019)
14. Wiegmann, M., Stein, B., Potthast, M.: Overview of the Celebrity Profiling Task at PAN
2019. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and
Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Akhtyamova</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cardiff</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Twitter author profiling using word embeddings and logistic regression</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bayot</surname>
            ,
            <given-names>R.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonçalves</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Author profiling using svms and word embedding averages</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          . pp.
          <fpage>815</fpage>
          -
          <lpage>823</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavancas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zangerle</surname>
          </string-name>
          , E.: Overview of PAN 2019:
          <article-title>Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Crestani,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Heinatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ). Springer (Sep
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Grant</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Quantifying evidence in forensic authorship analysis</article-title>
          .
          <source>International Journal of Speech, Language &amp; the Law</source>
          <volume>14</volume>
          (
          <issue>1</issue>
          ) (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Posadas-Durán</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Author profiling with doc2vec neural network-based document embeddings</article-title>
          .
          <source>In: Mexican International Conference on Artificial Intelligence</source>
          . pp.
          <fpage>117</fpage>
          -
          <lpage>131</lpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          :
          <article-title>Adapting cross-genre author profiling to language and corpus</article-title>
          . In: CLEF (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          :
          <article-title>The winning approach to crossgenre gender identification in russian at rusprofiling 2017</article-title>
          . In: FIRE (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>