=Paper=
{{Paper
|id=Vol-1866/paper_80
|storemode=property
|title=UniNE at CLEF 2017: TF-IDF and Deep-Learning for Author Profiling
|pdfUrl=https://ceur-ws.org/Vol-1866/paper_80.pdf
|volume=Vol-1866
|authors=Nils Schaetti
|dblpUrl=https://dblp.org/rec/conf/clef/Schaetti17
}}
==UniNE at CLEF 2017: TF-IDF and Deep-Learning for Author Profiling==
<pdf width="1500px">https://ceur-ws.org/Vol-1866/paper_80.pdf</pdf>
<pre>
    UniNE at CLEF 2017: TF-IDF and Deep-Learning for
                    Author Profiling
                        Notebook for PAN at CLEF 2017

                                        Nils Schaetti

                                   University of Neuchâtel
                                    rue Emile Argand 11
                                 2000 Neuchâtel, Switzerland
                                   nils.schaetti@unine.ch


       Abstract. This paper describes and evaluates a strategy for author profiling using
       TF-IDF and a Deep-Learning model based on Convolutional Neural Networks.
       We applied this strategy to the author profiling task of the PAN17 challenge and
       show that it can be applied to different languages (English, Spanish, Portuguese
       and Arabic). As features, we suggest using a simple cleaning method for both
       models, and for the Deep-Learning model, a matrix of 2-grams of letters with
       punctuation marks, beginning and ending 2-grams, as features. Applying this
       strategy, we determine that the TFIDF-based model is the best one for language
       variety classification and that the Deep-Learning model achieve the highest accu-
       racy on gender classification. The evaluations are based on four tweet collections
       (PAN AUTHOR P ROFILING task at CLEF 2017).


1    Introduction

   Today, a large amount of data is produced by web applications based on social con-
texts, like social networks and blogs, where a variety of contents (e.g. pictures, videos,
articles, links, texts) are shared directly from web sites and smartphones. Social net-
works like Facebook and Twitter allow a new kind of communication based on fast
interactions which generate multimedia contents with their own characteristics, that are
difficult to compare with traditional texts like essays and articles.
   This rises new questions : can we detect differences in writing styles between men
and women, language varieties, age groups or psychological profiles? Theses questions
are appealing as answers to new problems created by the age of social networks and
blogs, such as fake news, plagiarism and identity theft. The problem of author profiling
is therefore of particular interest.
   Moreover, author profiling is becoming more and more important for applications
in marketing, security and forensics. For example, in forensic linguistics, one would
like to know certain characteristics (gender, age group, socio-cultural background) of
an author of harassing messages from its linguistic profile. In marketing, companies
and resellers would like to know the characteristics of people liking or disliking their
products based on the analysis of blogs and product reviews.
   This paper is organised as follow. Section 2 introduces the dataset used for training
and testing, as well as the methodology used to evaluate our approach. Section 3 de-
scribes the cleaning and tokenization process. Section 4 explains the proposed TFIDF-
based model. Section 5 describes the Deep-Learning based classifier. In section 6, we
evaluate the strategy we created and compare results on the four different test collec-
tions. In the last section, we draw conclusions on the main findings and possible future
improvements.


2   Tweet collections and methodology
   To carry out experiments on the author profiling task with different algorithms, we
need a common ground composed of the same datasets and evaluation measures. In
order to create this common ground, and to allow the large study in the domain of author
profiling, the PAN CLEF evaluation campaign was launched ([7]). Multiple research
groups with different backgrounds from around the world have proposed a profiling
algorithm to be evaluated in the PAN CLEF 2017 campaign with the same methodology
[5].
   All teams have used the TIRA platform to evaluate their strategy. This platform can be
used to automatically deploy and evaluate a software [4]. The algorithms are evaluated
on a common test dataset and with the same measures, but also on the base of the time
need to produce the response. The access to this test dataset is restricted so that there is
no data leakage to the participants during a software run.
   For the PAN CLEF 2017 evaluation campaign, four test collections of tweets were
created, one for each of the following languages : English, Spanish, Portuguese and
Arabic. Based on these collections, the problem to address was to predict the author’s
language variety (varieties of the main language) and its gender [6].
   The training data was collected from Twitter. For each tweet collections, the texts
come from the same language and are composed of tweets from authors, 100 tweets per
authors. For each author, there is two labels we can predict :
 1. The author’s gender (male, female);
 2. The author’s language variety, specific to the language;
   The test sets are also texts collected from Twitter and the task is therefore to predict
the gender and language variety for each Twitter author in the test data. There is one
collection per language (English, Spanish, Portuguese and Arabic). The English collec-
tion is composed of 3600 authors, coming from six different countries, United-States,
Great Britain, Ireland, New Zealand, Australia and Canada, 600 for each variety, and
1800 for each gender, for a total of 360’000 tweets.
   The Spanish collection is composed of 4200 authors, coming from seven different
countries, Colombia, Argentina, Spain, Venezuela, Peru, Child and Mexico, 2100 for
each gender, for a total of 420’000 tweets.
   The Portuguese collection is composed of 1200 authors coming from Brazil and Por-
tugal, 600 for each gender, for a total of 120’000 tweets.
   Finally, the Arabic collection is composed of 2400 authors from Gulf, Levantine,
Maghrebi and Egypt, 1200 for each gender, for a total of 240’000 tweets.
   An overview of these collections is depicted in table 1. The number of authors from
                                                   Training sets
      Corpus                 Authors      Tweets      Language varieties    Genders
                                                       US, GB, Ireland,
      English                  3600        360k         New Zealand,     1800; 1800
                                                       Australia, Canada
                                                          Columba,
                                                       Argentina, Spain,
      Spanish                  4200        420k                            2100; 2100
                                                       Venezuela, Peru,
                                                        Chile, Mexico
      Portuguese               1200        120k         Portugal, Brazil    600; 600
                                                       Gulf, Levantine,
      Arabic                   2400        240k                            1200; 1200
                                                       Maghrebi, Egypt

                   Table 1: PAN CLEF 2017 training corpora statistics


the training set is given under the label ”Authors” and the total number of tweets in
the collection is indicated under the label ”Tweets”. The label ”Languages varieties”
shows the varieties for each collections, and the label ”Genders” indicates the number
of authors for each gender.
   The training data set is well balanced as for each collection, there is the same number
of authors for each language variety and gender. The Spanish collection is the biggest
with 4’200 authors, and the smallest is the Portuguese collection with 1’200 authors.
All the collections have the same number of authors for each language varieties (600),
but varies for the genders.
   A similar test set will be used to compare the participants’ strategies of the PAN
CLEF 2017 campaign, and we don’t have information about its size due to the TIRA
system.
   For the PAN CLEF 2017 campaign, the software must provide its answer to each
problem as an XML data. The response for the gender is a binary choice (male / fe-
male), and the language variety is one of the possible outputs for the language of the
collection.
   The overall performance of the system is the joint accuracy of the gender and lan-
guage variety. The accuracy is the number of authors where both the gender and lan-
guage variety is correctly predicted for the same author divided by the number of au-
thors in the collection.
   The accuracy for language varieties and genders are also computed as the number of
correct answers divided by the total number of authors.


3   Cleaning and Tokenization

   Before selecting the features from the texts, we need to clean the text and extract
tokens. This section aims to explain these two steps.
   To carry out these two steps, we apply a serie of rules to the tweet’s text in the
following order :
 1. Remove URL (http://.../ );
 2. Remove Twitter quotes (@username);
 3. Remove special characters #, $ and S;
 4. Clean numbers (100.00 → 100 S 00, 100,00 → 100 $ 00, 100’000 → 100 # 000);
 5. Tokenize punctuation (???? → ” ? ? ? ? ”);
 6. Remove new line character;
 7. Replace useless characters by space (-, ..., *, /, +, ∖);
 8. Remove multiple spaces;
   The step 4 aims to introduce three tokens (S, $ and #) that indicate the way the user
represents numbers (points or comma for float, comma or apostrophe for thousands).
     Each collection are from a different language and, therefore, may use a different
alphabet. For each collection, we keep in the text only the letters and punctuation cor-
responding to the language’s alphabet. The alphabets for each language are depicted in
the table 2.
   We keep letters with accents that are not often used in the language to represent their
usage by the authors. For example, we keep accents in the English texts to represent the
usage of French words by English authors that could help as information for profiling
the author. We make the hypothesis that authors use more or less French or Spanish
words depending on the country from which they’re coming.
   Concerning the hashtags, we keep them and compute them as normal words. At the
end of the cleaning process, the words are splitted by space and used as tokens for the
feature selection.


4     TFIDF-based model
   The TF-IDF (Term-Frequency-Inverse Document Frequency) is a weighting method
often used in information retrieval and mostly in text mining. This statistical measure


                                                                                         Punctua-
 Language Alphabet
                                                                                           tion

              AaÀàÁáÂâÃãBbCcÇçDdEeÉéÈèÊêFfGgHhIiÍı́Ïı̈Îı̂JjKkLlMmNn
    English                                                                                ?.!,;:
              OoÓóÔôÕõPpQqRrSsTtUuÚúÜüÛûVvWwXxYyZz

              AaÀàÁáÂâÃãBbCcÇçDdEeÉéÈèÊêFfGgHhIiÍı́Ïı̈Îı̂JjKkLlMmNn
    Spanish                                                                               ?¿.!¡,;:
              ÑñOoÓóÔôÕõPpQqRrSsTtUuÚúÜüÛûVvWwXxYyZz

              AaÀàÁáÂâÃãBbCcÇçDdEeÉéÈèÊêFfGgHhIiÍı́Ïı̈Îı̂JjKkLlMmNn
Portuguese                                                                                 ?.!,;:
              OoÓóÔôÕõPpQqRrSsTtUuÚúÜüÛûVvWwXxYyZz


    Arabic

                 Table 2: Language alphabet for each language collection
makes it possible to assess the importance of a term in a document, in relation to a
collection or corpus. The raw frequency of a term in simply the number of occurrences
of this term in a specific document, this frequency is also called term frequency.
   The inverse document frequency is a measure of the importance of a term in the whole
collection. In the TF-IDF model, it gives more weight to less frequent terms, considered
to be more discriminatory. It consists in calculating the base-10 logarithm of the inverse
of the proportion of documents in the corpus that contain the term.
   To describe our problem more formally, a document 𝑑 in our collection is the set of
all tweets belonging to a class to predict (gender or language variety). 𝐷 is the set of
all documents in the collection and |𝐷| is the number of documents in the collection
(|𝐷| = 2 if the problem is to predict gender).
     The term frequency for a term 𝑡 and a document 𝑑 is therefore defined by
                                                𝑛𝑑,𝑡
                                      𝑡𝑓𝑑,𝑡 =
                                                |𝑑|
   where 𝑛𝑑,𝑡 is the number of occurrences of the term 𝑡 in the document 𝑑. The term
frequency 𝑡𝑓𝑑,𝑡 is then the number of occurrence of the term 𝑡 in document 𝑑 divided
by the total number of tokens in the document.
   The inverse document frequency of a term 𝑡 in the whole collection is,
                                                  |𝐷|
                                𝑖𝑑𝑓𝑡 = 𝑙𝑜𝑔
                                             |{𝑑 : 𝑡 ∈ 𝑑}|
   where |𝐷| is the number of classes in the classification problem and |{𝑑 : 𝑡 ∈ 𝑑}|
is the number of document(s) where the term 𝑡 appears. The final tfidf value for a
document 𝑑 and term 𝑡 is defined by
                                               𝑛𝑑,𝑡             |𝐷|
                  𝑡𝑓 𝑖𝑑𝑓𝑑,𝑡 = 𝑡𝑓𝑑,𝑡 * 𝑖𝑑𝑓𝑡 =        * 𝑙𝑜𝑔
                                               |𝑑|        |{𝑑𝑗 : 𝑡 ∈ 𝑑𝑗 }|
    For each document 𝑑, we compute a vector 𝑡𝑓 𝑖𝑑𝑓𝑑 with the 𝑡𝑑𝑖𝑑𝑓 values for each


Fig. 1: Structure of the input features matrix. From left to right, the matrix represents
the ratios of 2-grams of letters and of single-letter tokens, the ratio of word ending 2-
grams, the ratio of word beginning letters, the ratio of punctuation marks, and the ratio
of word beginning 2-grams.
terms in the collection. If a term does not appear in the document, its value is set to
zero. When we want to predict the class (gender or language variety) of a previously
unseen author from the collection, we consider it as query 𝑞 and compute the cosinus
similarity between the 𝑡𝑓 𝑖𝑑𝑓𝑑 vector and the vector 𝑡𝑓𝑞 of term frequencies in 𝑞 :

                                               𝑡𝑓 𝑖𝑑𝑓𝑑 · 𝑡𝑓𝑞
                             𝑠𝑖𝑚(𝑑, 𝑞) =
                                            ‖𝑡𝑓 𝑖𝑑𝑓𝑑 ‖ * ‖𝑡𝑓𝑞 ‖

where
                                                  𝑛𝑞,𝑡
                                        𝑡𝑓𝑞,𝑡 =
                                                  |𝑑|
   and finally, we choose the predicted class 𝑐ˆ𝑞 for the query 𝑞 with the biggest similar-
ity. For example, in the case of gender, we choose 𝑐ˆ𝑞 as

                            𝑐ˆ𝑞 =        max             𝑠𝑖𝑚(𝑑, 𝑞)
                                    𝑑∈{𝑚𝑎𝑙𝑒,𝑓 𝑒𝑚𝑎𝑙𝑒}


5   Two-grams of letters-based Convolutional Neural Networks

   In machine learning, a Convolutional Neural Network (or CNN) is a kind of feed-
forward artificial neural network [1] [2] [3], in which the patterns of connection be-
tween the neurons are inspired from the visual cortex.
   In our system, we applied a CNN to a matrix representing the 2-grams of letters for
an author in a collection. The figure 1 shows the structure of the 2-gram matrix.
   There is a row for each letter in the alphabet. From left to right, the matrix is com-
posed of a first part where ratio of each 2-gram of letters in the author’s tweets, the
upper line representing the ratio of alone letters. The second part is composed of ratio


Fig. 2: Structure of the Convolutional Neural Network on 2-grams of letters with the
following layers : 10x5x5 kernel, 20x5x5 kernel, drop-out, two linear layer (ReLU),
softmax.
of ending letters. The next part is the ratio of ending 2-gram of letters computed from
the tokens obtained after the cleaning process of section 3.
   The fourth part is the ratio of first letters in tokens, and the fifth the ratio of punctua-
tion marks. Finally, the last part is the ratio of 2-grams found at the beginning of each
tokens. This matrix representing an author is the input for the Convolutional Neural
Network shown in figure 2.
   The first layer is a convolution layer of 10 kernels of size 5 × 5. This layer is the
input for a second convolution layer of 20 kernels of size 5 × 5 followed by a drop-out
layer. This serves as an input for two linear layers based on ReLU. Finally, the outputs
are obtained from a softmax function and give the predicted class of the author. The
predicted class is then the class with the highest corresponding output of the softmax
function.
    The training phase consists of using 90% of the training set for training and 10% to
evaluate the performances at each iteration. The figure 3 shows the evolution of training
and test losses after each iteration of the training phase for each language collection.
Vertical lines show the lowest test loss for each collection.
   For English, the lower loss is attained after 64 iterations and 66 for Spanish. For Por-
tuguese and Arabic, the lower losses are respectively attained after 87 and 38 iterations.
We can see that our CNN model quickly overfit, especially for the Arabic language col-
lection. The main challenge with this model is then to fight effectively overfitting.
   At the end of the training phase, we choose the CNN obtained at the iteration with
the lowest loss.


6      Evaluation

   To evaluate our two models we tested their accuracy on each language training col-
lections. The table 3 shows the results of 10-fold cross validation for each combination


 0.7

 0.6

 0.5

 0.4
       0      10       20      30      40      50        60     70       80      90      100
                                            Iterations
                              English (training)          English (test)
                              Spanish (training)          Spanish (test)
                             Portuguese (training)       Portuguese (test)
                               Arabic (training)           Arabic (test)

    Fig. 3: Training and test loss after 𝑛 iterations for the gender task on each collection
of model and collection with the random baseline as comparison.
   For English, the TF-IDF model attains 83% and 68% accuracy on language vari-
ety and gender respectively against 16% and 50% for the random classifier. The CNN
model got respectively 65% and 78% accuracy for language variety and gender. The
combination of the two models achieve a final accuracy of 65%.
   For Spanish, the TF-IDF model got 93% and 64% accuracy on language variety and
gender respectively against 14% and 50% for the random classifier. The CNN model
achieves respectively 78% and 72% accuracy for language variety and gender. The com-
bination of the two models got a final accuracy of 67% against 7% for the random clas-
sifier.
   For Portuguese, the TF-IDF model achieved an accuracy of 99% and 73% on the
language variety and gender classification task respectively. The CNN model got an ac-
curacy of respectively 98% and 85% for the language variety and gender tasks against
50% for a random classifier.
   For Arabic, the TF-IDF model achieved an accuracy of 86% and 68% for the lan-
guage variety and gender classification problem compared to 25% and 50% for a ran-
dom classifier. The CNN model got an accuracy of respectively 67.5% and 75% for
language variety and gender. The combination of both models gives a final accuracy of
64%.
   We can see that the no free-lunch principle applies here as the TF-IDF achieved bet-
ter results on the language variety profiling while the CNN model achieves better results
when it comes to profiling gender.
   The TF-IDF model achieves an impressive results for the language variety task on
the Spanish collection with 93% accuracy while there is 7 different possible varieties
(Argentina, Spain, Chile, Mexico, Colombia, Venezuela, Peru). Its best accuracy is at-


           Corpus                  TF-IDF      CNN        Final     Random
           English varieties        0.8333     0.6563                0.1666
           English genders          0.6805     0.7803                0.5000
                                    0.4724     0.5228     0.6502     0.0833
           Spanish varieties        0.9323     0.7804                0.1428
           Spanish genders          0.6491     0.7238                0.5000
                                    0.6051     0.5648     0.6747     0.0714
           Portuguese varieties     0.9925     0.9833                0.5000
           Portuguese genders       0.7317     0.8500                0.5000
                                    0.5313     0.8358     0.8436     0.2500
           Arabic varieties         0.8609     0.6750                0.2500
           Arabic genders           0.6888     0.7500                0.5000
                                    0.5929     0.5028     0.6456     0.1250
           Overall                  0.6228                0.7035     0.3824

           Table 3: 10-fold cross validation on the four training collections
tained on the Portuguese collection with 99% accuracy and its lowest on the gender
classification on the Spanish collection. The CNN model achieves its best performance
on the Portuguese collection for language variety classification with 98.3% accuracy
and its lowest on the English collection for language variety classification with 65.6%.
   The table 4 shows the results on the six training collections obtained on the TIRA
platform.
    The results shows the same pattern as the previous 10-fold cross validation. Portuguese-


           Corpus                   Variety    Gender      Both      Random
           English                  0.9717     0.7903     0.7681     0.0833
           Spanish                  0.9895     0.7674     0.7595     0.0714
           Portuguese               0.9992     0.8367     0.8358     0.2500
           Arabic                   0.8967     0.7521     0.6904     0.1250
           Overall                  0.9642     0.7866     0.7634      0.3824

                  Table 4: Evaluation for the four training collections


speaking authors are the easier to profile with an overall accuracy of 83%, a little bit
lower than with the 10-fold CV (84%). The Arabic are the hardest to profile with an
overall accuracy of 69% against 64% on the 10-fold CV. The language variety accuracy
is much higher than the gender accuracy, as for the 10-fold CV.
   The difference between the training results and the previous 10-fold cross validation
on the gender problem is 11.8 for English, 8.4 for Spanish, -0.78 for Portuguese and 4.5
for Arabic. We have clear differences between the test and training accuracy except for
Portuguese.
The table 5 shows the results obtained on the four test collections, thanks to the TIRA
platform. For the English language collection, the accuracy goes from 83% with the
10-Fold CV to 81.5% (-1.83) for language variety, and from 78% to 75% (-2.86) for
gender classification.
   The accuracy on the Spanish language collection goes from 93.23% to 93.36% (-
0.13) and from 72.3% to 71.07% (-1.31) for language variety and gender respectively.
For the Portuguese collection, the accuracy goes from 99.25% to 98.38% (-0.87) and
from 85% to 72% (-12.5) for language variety and gender respectively. Finally, for the
Arabic collection, the accuracy goes respectively from 86% to 81% (-4.78) and from
75% to 71% (-3.56) for language variety and gender.


7   Conclusion
   This paper proposes a combination of TFIDF-based model and Deep-Learning Con-
volutional Neural Network to predict the language variety and gender of Twitter authors.
Based on the hypothesis that an author’s writing style can be used to extract its coun-
try of origin and its gender, we introduced classifiers that can effectively predict these
two characteristics. The TFIDF-based model shows a good performance on language
variety classification, on the other hand, the CNN model is effective to classify authors
            Corpus                     Variety     Gender       Both      Random
            English                    0.8150      0.7517      0.6133      0.0833
            Spanish                    0.9336      0.7107      0.6657      0.0714
            Portuguese                 0.9838      0.7250      0.7138      0.2500
            Arabic                     0.8131      0.7144      0.5863      0.1250
            Overall                    0.8863      0.7254      0.6447      0.3824

                      Table 5: Evaluation for the four test collections


in gender classes. For both model and for all language, we proposed a simple cleaning
process used by both classifiers, and we selected features as a matrix of ratio of various
2-grams for the CNN classifier.
   The TFIDF-based model performs well on language variety classification and achieves
its best performance, on the test dataset, on the Portuguese collection with 98% accu-
racy, and an interesting accuracy of 93% on the Spanish collection. The performances
on the English and Arabic collections stay behind with respectively 81.5% and 81.3%.
The CNN model achieves its best performance on the English collection with 75.1%
accuracy. Furthermore, we see two ways to improve this strategy in the future. First,
the CNN classifier shows signs of overfitting and a great difference appears between the
10-fold CV and the final test results. Some improvements could probably be done on
the training phase. Secondly, the matrix of 2-grams could be improved by determining
which features are useful or not, this could significantly lower the computational com-
plexity of the model. The biggest challenge for the CNN model is the small size of the
training collections and more work could be done on this point to improve the overall
performances.
   The biggest challenge of this year’s PAN author profiling task were the gender clas-
sification problem were our model achieve an average of 72.5% accuracy compare with
88.6% for language variety classification.


References

1. Ciresan, D.C., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image
   classification. CoRR abs/1202.2745 (2012), http://arxiv.org/abs/1202.2745
2. Fukushima, K.: Neocognitron: A self-organizing neural network model for a mechanism of
   pattern recognition unaffected by shift in position. Biological cybernetics 36(4), 193–202
   (1980)
3. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
   recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
4. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the Re-
   producibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and Author
   Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms,
   E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality, and Visualiza-
   tion. 5th International Conference of the CLEF Initiative (CLEF 14). pp. 268–299. Springer,
   Berlin Heidelberg New York (Sep 2014)
5. Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview of
   PAN’17: Author Identification, Author Profiling, and Author Obfuscation. In: Jones, G., Law-
   less, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N. (eds.) Experi-
   mental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference
   of the CLEF Association, CLEF 2017, Dublin, Ireland, Septembre 11-14, 2017, Proceedings.
   Springer, Berlin Heidelberg New York (Sep 2017)
6. Rangel, F., Rosso, P., Potthast, M., Stein, B.: In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl,
   T. (eds.) CLEF 2017 Labs Working Notes
7. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the
   4th author profiling task at pan 2016: cross-genre evaluations. Working Notes Papers of the
   CLEF (2016)

</pre>