<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Know your Neighbors: Efficient Author Profiling via Follower Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Boško Koloski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Senja Pollak</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Blaž Škrlj</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Information Science - University of Ljubljana</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jožef Stefan Institute</institution>
          ,
          <addr-line>Ljubljana</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>User profiling based on social media data is becoming an increasingly relevant task with applications in advertising, forensics, literary studies and sociolinguistic research. Even though profiling of users based on their textual data is possible, social media such as Twitter offer also insight into the data of a given user's followers. The purpose of this work was to explore how such follower data can be used for profiling a given user, what are its limitations and whether performances, similar to the ones observed when considering a given user's data directly can be achieved. In this work we present our approach, capable of extracting various feature types and, via sparse matrix factorization, learn a dense, low-dimensional representations of individual persons solely from their followers' tweet streams. The proposed approach scored second in the PAN 2020 Celebrity profiling shared task, and is computationally non-demanding.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        User profiling on social media is becoming an increasingly relevant task when detecting
problematic users or bots. In the era of social media, text-based representations of such
users need to be learned, which is becoming a lively research area [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Online social
media, such as Twitter, offer an unique opportunity to test to what extent properties of
users can be predicted, and what potential implications of such learning endeavours are
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This paper discusses the challenge of predicting a given user’s property based solely
on the information captured from a given user’s followers’ texts. The paper explores to
what extent the follower data offers profiling capabilities and what are its limitations.
The schematic overview of the scenario considered in this work is shown in Figure 1.
The remainder of this work is structured as follows. In Section 2 we present the related
work, followed by the description of the proposed system (Section 4), experimental
evaluation (Section 6) and the concluding remarks in Section 8.
One of the first author profiling tasks was gender prediction by Koppel et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
who conducted experiments on a subset of the British National Corpus and found that
women have a more relational writing style and men have a more informational writing
?
style. While deep learning approaches have been recently prevailing in many natural
language processing and text mining tasks, the state-of-the-art research on gender
classification mostly relies on extensive feature engineering and traditional classifiers.
      </p>
      <p>
        Examples of previous PAN competition winners include [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (who used support
vector machines), however, the second ranked solution [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] was even simpler, employing
only logistic regression classifier with features containing also emoji information and
similar. In PAN 2016, the best gender classification performance was achieved by [8],
who employed a Logistic regression classifer and used word unigrams, word bigrams
and character four-gram features.
      </p>
      <p>
        PAN 2016 AP shared task also dealt with age classification. The winners in this
task [12] used a linear SVM model and employed a variety of features: word, character
and POS tag n-grams, capitalization (of words and sentences), punctuation (final and
per sentence), word and text length, vocabulary richness, emoticons and topic-related
words. We acknowledge also the research of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], who among other classification tasks
also dealt with the prediction of text author’s occupation on Spanish tweets. They
evaluated several classification approaches (bag of terms, second order attributes
representation, convolutional neural network and an ensemble of n-grams at word and character
level) and showed that the highest performance can be achieved with an ensemble of
word and character n-grams. Finally, the modeling task addressed in this work is
similar to the last year’s PAN Celebrity Profiling Challenge that aimed at predicting age,
gender, fame and occupation[13], from which we also sourced some of the ideas used
in the final models. The winning approach last year used tf-idf features with logistic
regression and SVM classifiers [10].
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Dataset Description and Preprocessing</title>
      <p>The training set for the PAN 2020 Celebrity Profiling shared task is composed of
English tweets of follower feeds of 1,920 celebrities, labeled in three categories: gender,
occupation and birthyear. The dataset is balanced towards gender and occupation, while
the birthyear label is not balanced. Distribution of the gender and occupation data is
shown in Table 1 and birthyear data is presented in Figure 2 containing the original
distribution and the augmented one, as described in Section 6.</p>
      <p>For getting the data prepared we firstly select 20 tweets for 10 authors for each
celebrity, meaning 200 tweets in total for each celebrity in our data. Next, the tweet
data is concatenated and preprocessed, as discussed next.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Feature Construction and Classification Model</title>
      <p>The following section includes description of the proposed method and its intermediary
steps.</p>
      <p>
        Before feature construction, dimensionality reduction and classifier application, in
the initial step we construct multiple representations of a given user that we denote as a
collection C. The space of constructed features, similarly to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], is based on:
– original text
– punctuation free - from the original text we removed punctuation
– stop-words free - from the punctuation free version stop words are removed
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Authomatic feature seleciton</title>
      <p>The collection C consists of multiple representations for each author, offering large
space of potential features. We focused on character and word-level features to capture
potentially interesting semantics. For this step, we used the SciKit-learn’s [9] word
tokenizer. The generated features are described as follows:
– character based - from each part in the collection C we generate character n-grams
(up to 1 or 2 characters) and up to n2 maximum allowed character features.
– word based- from each part in the collection C we generate word n-grams (up to
1,2,3 words) and up to n2 maximum allowed word n-gram features
At the conclusion of the pipeline execution, we have prepared word and character
features from each celebrity’s collection of tweets, ready to be used in the feature selection
step, which are finally joined via SciKit-learn’s FeatureUnion.
5.1</p>
      <sec id="sec-4-1">
        <title>Dimensionality reduction via matrix factorization</title>
        <p>Finally, we perform sparse singular value decomposition (SVD)1 that can be
summarized via the following expression:</p>
        <p>M = U</p>
        <p>V T :
The final representation (embedding) E is obtained by multiplying back only a portion
of the diagonal matrix ( ) and U , giving a low-dimensional, compact representation
of the initial high dimensional matrix. Note that E 2 RjDj d, where d is the number
of diagonal entries considered. The obtained E is suitable for a given down-stream
learning task, such as classification (considered in this work). Note that performing
SVD in the text mining domain is also commonly associated with the notion of latent
semantic analysis.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Classifier selection</title>
        <p>For each sub task we performed extensive grid-search using [9] GridSearchCV and
found classifiers that suited task the most. Following this goal we conducted a series of
experiments, consisting of trying different environments and linear models as presented
in the Section 6. Among the one we used were (SciKit learn’s [9]) Support Vector
Machines, Random Forests and Logistic Regression.
1 https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>Series of experiments were executed in order to find the best embedding space and
model. We explored various ways of modeling the birthyear variable:
– R - regression - where we applied linear regression and XGBoost Regressor [9]
learner to derive a simple model to predict the years, where we predicted birthyear
in the interval:</p>
      <p>max(1949; min(predicted_year; 1999)):
– FC - full classification - we applied classification learner to the task discrimination
between each of the 60 classes (one year = one class)
– AC - altered classification - we applied classification to an altered label space where
we reduced the number of labels to more balanced intervals, finally obtaining 8 of
them, hence: 1949 - 1958, 1959 - 1966, 1967 - 1973, 1974 - 1980, 1981 - 1986,
1987 - 1991, 1992 - 1995, 1996 - 1999. For the final reverse prediction in the
interval back we used the following estimates.
1. predicting the middle of the interval
2. predicting random year from the interval
For all tasks we considered GridSearchCV over parameter space to find best
hyperparameter configuration, dimension number k and the number of features to be
generated n. By doing 10-fold cross validation, the grid consisted of reducing the dimensions
parametrized by k in the following interval:</p>
      <p>k 2 [128; 256; 512; 640; 768; 1024; 2048]
and the number of generated n features from the interval</p>
      <p>n 2 [2500; 5000; 10000; 20000; 30000; 50000]:
The initial dataset was split to training(90%) and evaluation(10%) sets from after which
we obtain Ctraining and Cevaluation. Once constructed, the feature space was considered
for learning. We experimented with XGBoost, logistic regression and linear SVMs, of
which hyperparameters we optimized in 5 fold cross validation. Finally, we tested the
performance on the Cevaluation set.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>This section includes the results of the empirical evaluation, used to select the final
model. The obtained results are shown in table 2.</p>
      <p>name
model-AC-2
model-AC-1
model-FC-2
model-FC-1</p>
      <p>model-R
baseline-ngrams</p>
      <p>The best scoring model is model-AC-2, which we chose for (final) test evaluation.
Its hyperparameters were: n = 20000 features reduced to k = 512, while the Logistic
Regression (occupation and age)’s regularization was set to 2 = 1. For gender, the
SVM’s hyperparameters were 2 = 1, gamma factor = scale and the polynomial kernel
was used.</p>
      <p>The best preforming model of experiments conducted in Section 6 yielded the
following results on the test set on the TIRA site. We next present the official ranking of
the proposed solution on the final TIRA test set.</p>
      <p>The proposed system scored the second highest (the first listed in Figure 3 is the
baseline based solely on a given author’s tweet stream. It outperforms the generic
baselines, whilst maintaining a lower dimension of the representation.
8</p>
    </sec>
    <sec id="sec-7">
      <title>Discussion and conclusions</title>
      <p>As not a single competing submission (Figure 3) achieved performance above the
baseline trained on a given person’s tweets, this task demonstrates that such type of
classification is exceptionally hard, and needs to be fundamentally re-thought to overcome the
full-information models aware of a given person’s tweets. Significant improvement was
achieved from the thresholding of the years and reducing the number of age classes to
less than initial given, since the f1-score of age was based on the hit interval for years,
giving us an uphold for varying different interval pooling strategies, namely we used
two: first one based on generating the middle year in our predefined year interval and
the second was guessing a random number from the interval. The celebrity’s own tweets
and tweets of its followers gave competitive f1-scores while using relatively simple
features (no emojis or similar) and computationally efficient methods representation
construction methods. Finally, the score was calculated by calculating the harmonic mean
of f1-scores:
cRank =</p>
      <p>1
1 1 1
f1_occupation + f1_birthyear + f1_gender
As seen in the 7 section we believe that improving one the score on one subtask will
only benefit the whole model if we keep or improve the scores on the other subtasks.</p>
      <p>Further work will include trying out different division of the birthyear values by
trying out different thresholds, possibly trying to inject more semantically enriched
vectorization features [11] of tweets or improve the way the data is polled to build the
data representation for a single celebrity.
9</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>The work of the last author was funded by the Slovenian Research Agency through
a young researcher grant. The work was also supported by the Slovenian Research
Agency (ARRS) core research programme Knowledge Technologies (P2-0103), an ARRS
funded research project Semantic Data Mining for Linked Open Data (financed under
the ERC Complementary Scheme, N2-0078) and EU Horizon 2020 research and
innovation programme under grant agreement No 825153, project EMBEDDIA
(CrossLingual Embeddings for Less-Represented Languages in European News Media).
8. Modaresi, P., Liebeck, M., Conrad, S.: Exploring the effects of cross-genre machine
learning for author profiling in PAN 2016. CLEF 2016 Evaluation Labs and Workshop –
Working Notes Papers (2016)
9. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal
of Machine Learning Research 12, 2825–2830 (2011)
10. Radivchev, V., Nikolov, A., Lambova, A.: Celebrity Profiling using TF-IDF, Logistic
Regression, and SVM—Notebook for PAN at CLEF 2019. In: Cappellato, L., Ferro, N.,
Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers.</p>
      <p>CEUR-WS.org (Sep 2019), http://ceur-ws.org/Vol-2380/
11. Škrlj, B., Martinc, M., Kralj, J., Lavracˇ, N., Pollak, S.: tax2vec: Constructing interpretable
features from taxonomies for short text classification. arXiv preprint arXiv:1902.00438
(2019)
12. Busger op Vollenbroek, M., Carlotto, T., Kreutz, T., Medvedeva, M., Pool, C., Bjerva, J.,
Haagsma, H., Nissim, M.: Gronup: Groningen user profiling notebook for PAN at clef
2016. CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers (2016)
13. Wiegmann, M., Stein, B., Potthast, M.: Celebrity profiling. In: Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics. pp. 2611–2618 (2019)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aragón</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>López-Monroy</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          :
          <article-title>Author profiling and aggressiveness detection in spanish tweets: Mex-a3t 2018</article-title>
          . In: In
          <source>Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval</source>
          <year>2018</year>
          ),
          <source>CEUR WS Proceedings</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwyer</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Medvedeva</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rawee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haagsma</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nissim</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>N-gram: New groningen author-profiling model</article-title>
          .
          <source>In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum</source>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          . (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Batool</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khattak</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maqbool</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Precise tweet classification and sentiment analysis</article-title>
          .
          <source>In: 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS)</source>
          . pp.
          <fpage>461</fpage>
          -
          <lpage>466</lpage>
          . IEEE (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shimoni</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          :
          <article-title>Automatically categorizing written texts by author gender</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>17</volume>
          (
          <issue>4</issue>
          ),
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Posadas-Durán</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Author profiling with doc2vec neural network-based document embeddings</article-title>
          .
          <source>In: Mexican International Conference on Artificial Intelligence</source>
          . pp.
          <fpage>117</fpage>
          -
          <lpage>131</lpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Martinc</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Blaž Škrlj Pollak,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Fake or not: Distinguishing between bots, males and</article-title>
          .
          <article-title>CLEF 2019 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Martinc</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Škrjanec</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zupan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pollak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Pan 2017:
          <article-title>Author profiling-gender and language variety prediction</article-title>
          .
          <article-title>CLEF 2017 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>