UniNE at CLEF 2017: TF-IDF and Deep-Learning for Author Profiling Notebook for PAN at CLEF 2017

UniNE at CLEF 2017: TF-IDF and Deep-Learning for Author Profiling Notebook for PAN at CLEF 2017 NilsSchaetti nils.schaetti@unine.ch University of Neuchâtel

rue ; Emile Argand 11 2000 Neuchâtel Switzerland

UniNE at CLEF 2017: TF-IDF and Deep-Learning for Author Profiling Notebook for PAN at CLEF 2017 ACCE3BC3250059B7B566887750E8E6BD GROBID - A machine learning software for extracting information from scholarly documents

This paper describes and evaluates a strategy for author profiling using TF-IDF and a Deep-Learning model based on Convolutional Neural Networks. We applied this strategy to the author profiling task of the PAN17 challenge and show that it can be applied to different languages (English, Spanish, Portuguese and Arabic). As features, we suggest using a simple cleaning method for both models, and for the Deep-Learning model, a matrix of 2-grams of letters with punctuation marks, beginning and ending 2-grams, as features. Applying this strategy, we determine that the TFIDF-based model is the best one for language variety classification and that the Deep-Learning model achieve the highest accuracy on gender classification. The evaluations are based on four tweet collections (PAN AUTHOR PROFILING task at CLEF 2017).

Introduction

Today, a large amount of data is produced by web applications based on social contexts, like social networks and blogs, where a variety of contents (e.g. pictures, videos, articles, links, texts) are shared directly from web sites and smartphones. Social networks like Facebook and Twitter allow a new kind of communication based on fast interactions which generate multimedia contents with their own characteristics, that are difficult to compare with traditional texts like essays and articles.

This rises new questions : can we detect differences in writing styles between men and women, language varieties, age groups or psychological profiles? Theses questions are appealing as answers to new problems created by the age of social networks and blogs, such as fake news, plagiarism and identity theft. The problem of author profiling is therefore of particular interest.

Moreover, author profiling is becoming more and more important for applications in marketing, security and forensics. For example, in forensic linguistics, one would like to know certain characteristics (gender, age group, socio-cultural background) of an author of harassing messages from its linguistic profile. In marketing, companies and resellers would like to know the characteristics of people liking or disliking their products based on the analysis of blogs and product reviews.

This paper is organised as follow. Section 2 introduces the dataset used for training and testing, as well as the methodology used to evaluate our approach. Section 3 describes the cleaning and tokenization process. Section 4 explains the proposed TFIDFbased model. Section 5 describes the Deep-Learning based classifier. In section 6, we evaluate the strategy we created and compare results on the four different test collections. In the last section, we draw conclusions on the main findings and possible future improvements.

Tweet collections and methodology

To carry out experiments on the author profiling task with different algorithms, we need a common ground composed of the same datasets and evaluation measures. In order to create this common ground, and to allow the large study in the domain of author profiling, the PAN CLEF evaluation campaign was launched ( [7]). Multiple research groups with different backgrounds from around the world have proposed a profiling algorithm to be evaluated in the PAN CLEF 2017 campaign with the same methodology [5].

All teams have used the TIRA platform to evaluate their strategy. This platform can be used to automatically deploy and evaluate a software [4]. The algorithms are evaluated on a common test dataset and with the same measures, but also on the base of the time need to produce the response. The access to this test dataset is restricted so that there is no data leakage to the participants during a software run.

For the PAN CLEF 2017 evaluation campaign, four test collections of tweets were created, one for each of the following languages : English, Spanish, Portuguese and Arabic. Based on these collections, the problem to address was to predict the author's language variety (varieties of the main language) and its gender [6].

The training data was collected from Twitter. For each tweet collections, the texts come from the same language and are composed of tweets from authors, 100 tweets per authors. For each author, there is two labels we can predict :

1. The author's gender (male, female); 2. The author's language variety, specific to the language;

The test sets are also texts collected from Twitter and the task is therefore to predict the gender and language variety for each Twitter author in the test data. There is one collection per language (English, Spanish, Portuguese and Arabic). The English collection is composed of 3600 authors, coming from six different countries, United-States, Great Britain, Ireland, New Zealand, Australia and Canada, 600 for each variety, and 1800 for each gender, for a total of 360'000 tweets.

The Spanish collection is composed of 4200 authors, coming from seven different countries, Colombia, Argentina, Spain, Venezuela, Peru, Child and Mexico, 2100 for each gender, for a total of 420'000 tweets.

The Portuguese collection is composed of 1200 authors coming from Brazil and Portugal, 600 for each gender, for a total of 120'000 tweets.

Finally, the Arabic collection is composed of 2400 authors from Gulf, Levantine, Maghrebi and Egypt, 1200 for each gender, for a total of 240'000 tweets.

An overview of these collections is depicted in the training set is given under the label "Authors" and the total number of tweets in the collection is indicated under the label "Tweets". The label "Languages varieties" shows the varieties for each collections, and the label "Genders" indicates the number of authors for each gender.

The training data set is well balanced as for each collection, there is the same number of authors for each language variety and gender. The Spanish collection is the biggest with 4'200 authors, and the smallest is the Portuguese collection with 1'200 authors. All the collections have the same number of authors for each language varieties (600), but varies for the genders.

A similar test set will be used to compare the participants' strategies of the PAN CLEF 2017 campaign, and we don't have information about its size due to the TIRA system.

For the PAN CLEF 2017 campaign, the software must provide its answer to each problem as an XML data. The response for the gender is a binary choice (male / female), and the language variety is one of the possible outputs for the language of the collection.

The overall performance of the system is the joint accuracy of the gender and language variety. The accuracy is the number of authors where both the gender and language variety is correctly predicted for the same author divided by the number of authors in the collection.

The accuracy for language varieties and genders are also computed as the number of correct answers divided by the total number of authors.

Cleaning and Tokenization

Before selecting the features from the texts, we need to clean the text and extract tokens. This section aims to explain these two steps.

To carry out these two steps, we apply a serie of rules to the tweet's text in the following order :

1. Remove URL (http://.../ ); 2. Remove Twitter quotes (@username); 3. Remove special characters #, $ and S; 4. Clean numbers (100.00 → 100 S 00, 100,00 → 100 $ 00, 100'000 → 100 # 000); 5. Tokenize punctuation (???? → " ? ? ? ? "); 6. Remove new line character; 7. Replace useless characters by space (-, ..., *, /, +, ∖); 8. Remove multiple spaces;

The step 4 aims to introduce three tokens (S, $ and #) that indicate the way the user represents numbers (points or comma for float, comma or apostrophe for thousands).

Each collection are from a different language and, therefore, may use a different alphabet. For each collection, we keep in the text only the letters and punctuation corresponding to the language's alphabet. The alphabets for each language are depicted in the table 2.

We keep letters with accents that are not often used in the language to represent their usage by the authors. For example, we keep accents in the English texts to represent the usage of French words by English authors that could help as information for profiling the author. We make the hypothesis that authors use more or less French or Spanish words depending on the country from which they're coming.

Concerning the hashtags, we keep them and compute them as normal words. At the end of the cleaning process, the words are splitted by space and used as tokens for the feature selection.

TFIDF-based model

The TF-IDF (Term-Frequency-Inverse Document Frequency) is a weighting method often used in information retrieval and mostly in text mining. This statistical measure makes it possible to assess the importance of a term in a document, in relation to a collection or corpus. The raw frequency of a term in simply the number of occurrences of this term in a specific document, this frequency is also called term frequency.

The inverse document frequency is a measure of the importance of a term in the whole collection. In the TF-IDF model, it gives more weight to less frequent terms, considered to be more discriminatory. It consists in calculating the base-10 logarithm of the inverse of the proportion of documents in the corpus that contain the term.

To describe our problem more formally, a document 𝑑 in our collection is the set of all tweets belonging to a class to predict (gender or language variety). 𝐷 is the set of all documents in the collection and |𝐷| is the number of documents in the collection (|𝐷| = 2 if the problem is to predict gender).

The term frequency for a term 𝑡 and a document 𝑑 is therefore defined by

𝑡𝑓 𝑑,𝑡 = 𝑛 𝑑,𝑡 |𝑑|

where 𝑛 𝑑,𝑡 is the number of occurrences of the term 𝑡 in the document 𝑑. The term frequency 𝑡𝑓 𝑑,𝑡 is then the number of occurrence of the term 𝑡 in document 𝑑 divided by the total number of tokens in the document.

The inverse document frequency of a term 𝑡 in the whole collection is, For each document 𝑑, we compute a vector 𝑡𝑓 𝑖𝑑𝑓 𝑑 with the 𝑡𝑑𝑖𝑑𝑓 values for each Fig. 1: Structure of the input features matrix. From left to right, the matrix represents the ratios of 2-grams of letters and of single-letter tokens, the ratio of word ending 2grams, the ratio of word beginning letters, the ratio of punctuation marks, and the ratio of word beginning 2-grams. terms in the collection. If a term does not appear in the document, its value is set to zero. When we want to predict the class (gender or language variety) of a previously unseen author from the collection, we consider it as query 𝑞 and compute the cosinus similarity between the 𝑡𝑓 𝑖𝑑𝑓 𝑑 vector and the vector 𝑡𝑓 𝑞 of term frequencies in 𝑞 :

𝑖𝑑𝑓 𝑡 = 𝑙𝑜𝑔 |𝐷| |{𝑑 : 𝑡 ∈ 𝑑}|𝑠𝑖𝑚(𝑑, 𝑞) = 𝑡𝑓 𝑖𝑑𝑓 𝑑 • 𝑡𝑓 𝑞 ‖𝑡𝑓 𝑖𝑑𝑓 𝑑 ‖ * ‖𝑡𝑓 𝑞 ‖ where 𝑡𝑓 𝑞,𝑡 = 𝑛 𝑞,𝑡 |𝑑|

and finally, we choose the predicted class ĉ𝑞 for the query 𝑞 with the biggest similarity. For example, in the case of gender, we choose ĉ𝑞 as ĉ𝑞 = max 𝑑∈{𝑚𝑎𝑙𝑒,𝑓 𝑒𝑚𝑎𝑙𝑒} 𝑠𝑖𝑚(𝑑, 𝑞)

Two-grams of letters-based Convolutional Neural Networks

In machine learning, a Convolutional Neural Network (or CNN) is a kind of feedforward artificial neural network [1] [2] [3], in which the patterns of connection between the neurons are inspired from the visual cortex.

In our system, we applied a CNN to a matrix representing the 2-grams of letters for an author in a collection. The figure 1 shows the structure of the 2-gram matrix.

There is a row for each letter in the alphabet. From left to right, the matrix is composed of a first part where ratio of each 2-gram of letters in the author's tweets, the upper line representing the ratio of alone letters. The second part is composed of ratio of ending letters. The next part is the ratio of ending 2-gram of letters computed from the tokens obtained after the cleaning process of section 3.

The fourth part is the ratio of first letters in tokens, and the fifth the ratio of punctuation marks. Finally, the last part is the ratio of 2-grams found at the beginning of each tokens. This matrix representing an author is the input for the Convolutional Neural Network shown in figure 2.

The first layer is a convolution layer of 10 kernels of size 5 × 5. This layer is the input for a second convolution layer of 20 kernels of size 5 × 5 followed by a drop-out layer. This serves as an input for two linear layers based on ReLU. Finally, the outputs are obtained from a softmax function and give the predicted class of the author. The predicted class is then the class with the highest corresponding output of the softmax function.

The training phase consists of using 90% of the training set for training and 10% to evaluate the performances at each iteration. The figure 3 shows the evolution of training and test losses after each iteration of the training phase for each language collection. Vertical lines show the lowest test loss for each collection.

For English, the lower loss is attained after 64 iterations and 66 for Spanish. For Portuguese and Arabic, the lower losses are respectively attained after 87 and 38 iterations. We can see that our CNN model quickly overfit, especially for the Arabic language collection. The main challenge with this model is then to fight effectively overfitting.

At the end of the training phase, we choose the CNN obtained at the iteration with the lowest loss.

Evaluation

To evaluate our two models we tested their accuracy on each language training collections. The table 3 shows the results of 10-fold cross validation for each combination For English, the TF-IDF model attains 83% and 68% accuracy on language variety and gender respectively against 16% and 50% for the random classifier. The CNN model got respectively 65% and 78% accuracy for language variety and gender. The combination of the two models achieve a final accuracy of 65%.

For Spanish, the TF-IDF model got 93% and 64% accuracy on language variety and gender respectively against 14% and 50% for the random classifier. The CNN model achieves respectively 78% and 72% accuracy for language variety and gender. The combination of the two models got a final accuracy of 67% against 7% for the random classifier.

For Portuguese, the TF-IDF model achieved an accuracy of 99% and 73% on the language variety and gender classification task respectively. The CNN model got an accuracy of respectively 98% and 85% for the language variety and gender tasks against 50% for a random classifier.

For Arabic, the TF-IDF model achieved an accuracy of 86% and 68% for the language variety and gender classification problem compared to 25% and 50% for a random classifier. The CNN model got an accuracy of respectively 67.5% and 75% for language variety and gender. The combination of both models gives a final accuracy of 64%.

We can see that the no free-lunch principle applies here as the TF-IDF achieved better results on the language variety profiling while the CNN model achieves better results when it comes to profiling gender.

The TF-IDF model achieves an impressive results for the language variety task on the Spanish collection with 93% accuracy while there is 7 different possible varieties (Argentina, Spain, Chile, Mexico, Colombia, Venezuela, Peru). Its best accuracy is at- speaking authors are the easier to profile with an overall accuracy of 83%, a little bit lower than with the 10-fold CV (84%). The Arabic are the hardest to profile with an overall accuracy of 69% against 64% on the 10-fold CV. The language variety accuracy is much higher than the gender accuracy, as for the 10-fold CV.

The difference between the training results and the previous 10-fold cross validation on the gender problem is 11.8 for English, 8.4 for Spanish, -0.78 for Portuguese and 4.5 for Arabic. We have clear differences between the test and training accuracy except for Portuguese. The table 5 shows the results obtained on the four test collections, thanks to the TIRA platform. For the English language collection, the accuracy goes from 83% with the 10-Fold CV to 81.5% (-1.83) for language variety, and from 78% to 75% (-2.86) for gender classification.

The accuracy on the Spanish language collection goes from 93.23% to 93.36% (-0.13) and from 72.3% to 71.07% (-1.31) for language variety and gender respectively. For the Portuguese collection, the accuracy goes from 99.25% to 98.38% (-0.87) and from 85% to 72% (-12.5) for language variety and gender respectively. Finally, for the Arabic collection, the accuracy goes respectively from 86% to 81% (-4.78) and from 75% to 71% (-3.56) for language variety and gender.

Conclusion

This paper proposes a combination of TFIDF-based model and Deep-Learning Convolutional Neural Network to predict the language variety and gender of Twitter authors. Based on the hypothesis that an author's writing style can be used to extract its country of origin and its gender, we introduced classifiers that can effectively predict these two characteristics. The TFIDF-based model shows a good performance on language variety classification, on the other hand, the CNN model is effective to classify authors The TFIDF-based model performs well on language variety classification and achieves its best performance, on the test dataset, on the Portuguese collection with 98% accuracy, and an interesting accuracy of 93% on the Spanish collection. The performances on the English and Arabic collections stay behind with respectively 81.5% and 81.3%. The CNN model achieves its best performance on the English collection with 75.1% accuracy. Furthermore, we see two ways to improve this strategy in the future. First, the CNN classifier shows signs of overfitting and a great difference appears between the 10-fold CV and the final test results. Some improvements could probably be done on the training phase. Secondly, the matrix of 2-grams could be improved by determining which features are useful or not, this could significantly lower the computational complexity of the model. The biggest challenge for the CNN model is the small size of the training collections and more work could be done on this point to improve the overall performances.

The biggest challenge of this year's PAN author profiling task were the gender classification problem were our model achieve an average of 72.5% accuracy compare with 88.6% for language variety classification.

where|𝐷| is the number of classes in the classification problem and |{𝑑 : 𝑡 ∈ 𝑑}| is the number of document(s) where the term 𝑡 appears. The final tfidf value for a document 𝑑 and term 𝑡 is defined by 𝑡𝑓 𝑖𝑑𝑓 𝑑,𝑡 = 𝑡𝑓 𝑑,𝑡 * 𝑖𝑑𝑓 𝑡 = 𝑛 𝑑,𝑡 |𝑑| * 𝑙𝑜𝑔 |𝐷| |{𝑑 𝑗 : 𝑡 ∈ 𝑑 𝑗 }|

Fig. 2 :2Fig. 2: Structure of the Convolutional Neural Network on 2-grams of letters with the following layers : 10x5x5 kernel, 20x5x5 kernel, drop-out, two linear layer (ReLU), softmax.

Fig. 3 :3Fig. 3: Training and test loss after 𝑛 iterations for the gender task on each collection

table 11. The number of authors from

Table 1 :1PAN CLEF 2017 training corpora statistics

Table 2 :2Language alphabet for each language collection

Language Alphabet

English Aa Àà Áá Ââ ÃãBbCcC ¸c ¸DdEe Éé Èè ÊêFfGgHhIi Íí Ïï ÎîJjKkLlMmNn Oo Óó Ôô ÕõPpQqRrSsTtUu Úú Üü ÛûVvWwXxYyZz ?.!,;: Spanish Aa Àà Áá Ââ ÃãBbCcC ¸c ¸DdEe Éé Èè ÊêFfGgHhIi Íí Ïï ÎîJjKkLlMmNn ÑñOo Óó Ôô ÕõPpQqRrSsTtUu Úú Üü ÛûVvWwXxYyZz ?¿.!¡,;: Portuguese Aa Àà Áá Ââ ÃãBbCcC ¸c ¸DdEe Éé Èè ÊêFfGgHhIi Íí Ïï ÎîJjKkLlMmNn Oo Óó Ôô ÕõPpQqRrSsTtUu Úú Üü ÛûVvWwXxYyZz ?.!,;: Arabic

Table 3 :310-fold cross validation on the four training collections tained on the Portuguese collection with 99% accuracy and its lowest on the gender classification on the Spanish collection. The CNN model achieves its best performance on the Portuguese collection for language variety classification with 98.3% accuracy and its lowest on the English collection for language variety classification with 65.6%.The table4shows the results on the six training collections obtained on the TIRA platform.The results shows the same pattern as the previous 10-fold cross validation. Portuguese-CorpusTF-IDFCNNFinalRandomEnglish varieties0.83330.65630.1666English genders0.68050.78030.50000.47240.52280.65020.0833Spanish varieties0.93230.78040.1428Spanish genders0.64910.72380.50000.60510.56480.67470.0714Portuguese varieties0.99250.98330.5000Portuguese genders0.73170.85000.50000.53130.83580.84360.2500Arabic varieties0.86090.67500.2500Arabic genders0.68880.75000.50000.59290.50280.64560.1250Overall0.62280.70350.3824

Table 4 :4Evaluation for the four training collections

Table 5 :5Evaluation for the four test collections in gender classes. For both model and for all language, we proposed a simple cleaning process used by both classifiers, and we selected features as a matrix of ratio of various 2-grams for the CNN classifier.

CorpusVarietyGenderBothRandomEnglish0.81500.75170.61330.0833Spanish0.93360.71070.66570.0714Portuguese0.98380.72500.71380.2500Arabic0.81310.71440.58630.1250Overall0.88630.72540.64470.3824

Multi-column deep neural networks for image classification DCCiresan UMeier JSchmidhuber CoRR abs/1202.2745 2012 Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position KFukushima Biological cybernetics 36 4 1980 Gradient-based learning applied to document recognition YLecun LBottou YBengio PHaffner Proceedings of the IEEE the IEEE 1998 86 Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling MPotthast TGollub FRangel PRosso EStamatatos BStein Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14 EKanoulas MLupu PClough MSanderson MHall AHanbury EToms

Berlin Heidelberg New York

Springer Sep 2014 Overview of PAN'17: Author Identification, Author Profiling, and Author Obfuscation MPotthast FRangel MTschuggnall EStamatatos PRosso BStein Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Association, CLEF 2017 GJones SLawless JGonzalo LKelly LGoeuriot TMandl LCappellato NFerro

Dublin, Ireland; Berlin Heidelberg New York

Springer Septembre 11-14, 2017. Sep 2017 FRangel PRosso MPotthast BStein CLEF 2017 Labs Working Notes LCappellato NFerro LGoeuriot TMandl Overview of the 4th author profiling task at pan 2016: cross-genre evaluations FRangel PRosso BVerhoeven WDaelemans MPotthast BStein Working Notes Papers of the CLEF 2016