Language Variety and Gender Classification for Author Profiling in PAN 2017 Notebook for PAN at CLEF 2017

Language Variety and Gender Classification for Author Profiling in PAN 2017 Notebook for PAN at CLEF 2017 AlexanderOgaltsov ogaltsov@ap-team.ru Higher School of Economics Moscow Institute of Physics and Technology AlexeyRomanov alexey.romanov@phystech.edu Higher School of Economics Moscow Institute of Physics and Technology AntiplagiatCjsc Higher School of Economics Moscow Institute of Physics and Technology Language Variety and Gender Classification for Author Profiling in PAN 2017 Notebook for PAN at CLEF 2017 8D02F0A392C42A33F141490277E3F81E GROBID - A machine learning software for extracting information from scholarly documents

We describe the method of Author Profiling task. The task deals with study of profile aspects like gender and language variety. We explore an approach of using high-order char n-grams as features and logistic regression as a classifier for all subtasks. This approach appears to be simple and effective for the task. We also investigated feature importances and low-dimensional embeddings of the data.

Introduction

Author profiling task considers different profile dimensions of the author of the text. This year shared task [12] [11] is focusing on gender and language variety. Previous competitions explored properties like gender, age group [13] and personal traits [8]. This task is interesting from both industrial and scientific points of view. Applications like accurate advertising targeting, security and forensic fields make this task highly relevant for practice. Also, the task can be considered as a tool for filling missing information about a person in some political or demographic research. Research community also pays attention to the task special track of PAN [7] shared task is held since 2013. Each year contributed a new language or new profile dimension to classify. The common part of all years was gender identification. The first task was on blog data in Spanish and English [10]. Competition in 2014 concentrated on different sources like reviews, tweets etc. [9]. The task of 2015 extended by additional languages and realvalued personal traits [8]. The main characteristic of the most recent shared task was cross-genre. The target was to develop a model such that it will be robust to the domain of data [13]. Since gender identification was presented in all previous competitions, there were many tested approaches. The main features were n-grams and various text statistics [4]. Language variety task was first to appear at PAN 2017, but there were language variety detection competitions like Discriminating between similar languages and national language varieties (DSL) 2016 [1]. Winning approach of this contest used char n-grams in wide range (1-7) with a linear classifier [3]. We used this method not only for language variety task but also for gender classification. A new feature of the current shared task is language variety. Each language has several variants. For instance, we have two several Portuguese: Brazil variant and European one. The task is to distinguish one from another. Languages and their varieties can be found in Table 1. Our approach tries to automatically extract features for each of variant Portuguese, English, Spanish and Arabic without any linguistic knowledge. We use char n-grams as features and logistic regression as a classifier. Evaluation metric is accuracy for both subtasks.

Methodology

This section is about our approach to current PAN Author Profiling task. First, we briefly discuss preprocessing steps. Then, we describe how we construct the feature space. Finally, we explain our choice of logistic regression as our classifier.

Preprocessing

We did not perform any preprocessing like removing hashtags, HTML tags and urls, because we considered it as potentially informative features.

Classification

Our main assumption was to consider all short texts written by a single author as an object in machine learning task formulation. We formulated the problem as classification task with two or more classes depending on language (Table 1). If language has more than two varieties we used "one versus other" scheme. Let dataset

D = {(x i , y i )}, i = 1, . . . , m,

to be consisted of pairs "object-class", x i ∈ R n . Each object x i has one of Z class labels y i ∈ Y = {1, . . . , Z}. We have to find mapping f ∈ F : R d → Y, which minimizes empirical risk on dataset D:

f = arg min f ∈F xi,yi∈D [f (x i ) = y i ],

where F -family of models. Feature space was constructed such that for each language corpus we performed counting of character level n-gram in some range. This counts were used as features. The number of authors and features for different tasks can be founded in Table 2. One can see that the data is quite sparse. Density distribution of non-zero n-grams for Portuguese is shown in Figure 1. We did not used higher-order n-grams because of RAM restric- 2. Languages and Varieties tions, although [3] reported quality to increase up to 7 char n-gram level. We performed classification by means of logistic regression model with regularization parameter C = 1. Our choice was justified by the fact that logistic regression has high bias and low variance.

In this section we describe our results during cross-validation and on the test set. Next we present embedding of the data in low-dimensional space. Finally, we discuss about feature importances of our classifier.

Results and Data Visualization

Evaluation metric this task is accuracy:

Accuracy = T P + T N T P + F P + T N + T F

We evaluated quality of gender and language variety subtasks separately by using crossvalidation scheme with five folds. Results can be found in Table 3. It was interesting to see how data is located in a feature space. To do so we exploited modern dimensionality reduction and data visualization techniques. Our choice of algorithm was t-SNE [2] since it reported to be fast when the number of objects is small and tends to efficiently preserve local structure of the data. Also, Python scikit-learn [5] implementation of the algorithm supports sparse matrices as an input. Example for Portuguese authors is at Figure 3. Unfortunately, axes of this algorithm have no clear interpretation.

Feature importances

We investigated absolute values of coefficients of our model for Portuguese language variety. This values can be considered as feature importances (Figure 4). Axis x means position in array of linear regression coefficients sorted in descending order. Axis y is absolute value of the coefficient. One can see that on the one hand feature coefficients have pretty low magnitude, but on the other hand there is group of features with relatively high importance.

Conclusion and Future Work

We explored a simple and robust method for gender and language variety classification for PAN17 Author Profiling task. It turned out that high-order char n-grams are good features that are easy to generate with no need of handcrafting or expert linguistics knowledge. The main disadvantage of such features is that this is almost impossible to perform error analysis. We trained logistic regression classifier for both subtasks and evaluated accuracy measure. We will explore effects on quality measure due to adding even more n-grams.

Figure 1 .1Figure 1. Density distribution for Portuguese.

Example ROC-curve for language variety classification of Portuguese is shown at Figure2. FPR and TPR are false positive rate and true positive rate respectively with various classification threshold. We evaluated test scores via TIRA.[6]

Figure 2 .2Figure 2. ROC-curve for Portuguese language variety.

Figure 3 .3Figure 3. t-SNE data visualization for Portuguese.

Figure 4 .4Figure 4. Feature coefficients for Portuguese language variety.

Table 1 .1Languages and VarietiesLanguage VarietyPortuguese Portugal, BrazilEnglishAustralia, Canada, Great Britain, Ire-land, New Zealand, United StatesSpanishArgentina, Chile, Colombia, Mexico,Peru, Spain, VenezuelaArabicGulf, Levantine, Maghrebi, Egypt

Table 3 .3Language CV gender acc. CV variety acc. Test gender acc. Test variety acc. EvaluationPortuguese 0.80250.98500.79880.9725English0.79180.79130.78750.8092Spanish0.74560.88920.76000.8989Arabic0.72630.77390.72130.7556

Visualizing high-dimensional data using t-sne LMaaten GHinton Dsl shared task 2016. 2016. Nov 2008 9 Discriminating between similar languages and arabic dialect identification: A report on the third dsl shared task SMalmasi MZampieri NLjubešić PNakov AAli JTiedemann Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3) the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

Osaka, Japan

December 2016 The COLING 2016 Organizing Committee Exploring the Effects of Cross-Genre Machine Learning for Author Profiling in PAN 2016-Notebook for PAN at CLEF PModaresi MLiebeck SConrad CLEF 2016 Evaluation Labs and Workshop -Working Notes Papers KBalog LCappellato NFerro CMacdonald

Évora, Portugal

2016. September. Sep 2016 Scikit-learn: Machine learning in Python FPedregosa GVaroquaux AGramfort VMichel BThirion OGrisel MBlondel PPrettenhofer RWeiss VDubourg JVanderplas APassos DCournapeau MBrucher MPerrot EDuchesnay Journal of Machine Learning Research 12 2011 Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling MPotthast TGollub FRangel PRosso EStamatatos BStein Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14 EKanoulas MLupu PClough MSanderson MHall AHanbury EToms

Berlin Heidelberg New York

Springer Sep 2014 Overview of PAN'17: Author Identification, Author Profiling, and Author Obfuscation MPotthast FRangel MTschuggnall EStamatatos PRosso BStein Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Initiative (CLEF 17) GJones SLawless JGonzalo LKelly LGoeuriot TMandl LCappellato NFerro

Berlin Heidelberg New York

Springer Sep 2017 Overview of the 3rd Author Profiling Task at PAN 2015 FRangel FCelli PRosso MPotthast BStein WDaelemans CLEF 2015 Evaluation Labs and Workshop -Working Notes Papers LCappellato NFerro GJones ESan Juan Overview of the 2nd Author Profiling Task at PAN 2014 FRangel PRosso IChugur MPotthast MTrenkmann BStein BVerhoeven WDaelemans CLEF 2014 Evaluation Labs and Workshop -Working Notes Papers LCappellato NFerro MHalvey WKraaij

Sheffield, UK

September. Sep 2014 Overview of the Author Profiling Task at PAN 2013 FRangel PRosso MKoppel EStamatatos GInches Forner, P., Navigli, R., Tufis, D. 2013 CLEF Evaluation Labs and Workshop -Working Notes Papers

Valencia, Spain

September. Sep 2013 FRangel PRosso MPotthast BStein Working Notes Papers of the CLEF 2017 Evaluation Labs LCappellato NFerro LGoeuriot TMandl Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter FRangel PRosso PM S B CLEF 2017 Labs and Workshops Notebook Papers. CEUR Workshop Proceedings, CLEF and CEUR-WS Sep 2017 Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations FRangel Pardo PRosso BVerhoeven WDaelemans MPotthast BStein Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS Sep 2016