CIC-GIL Approach to Author Profiling in Spanish Tweets: Location and Occupation Ilia Markov1 , Helena Gómez-Adorno2 , Mónica Jasso-Rosales2 , and Grigori Sidorov1 1 Instituto Politécnico Nacional (IPN), Center for Computing Research (CIC), Mexico City, Mexico imarkov@nlp.cic.ipn.mx, sidorov@cic.ipn.mx, 2 Universidad Nacional Autónoma de México (UNAM), Engineering Institute (II), Mexico City, Mexico hgomeza@iingen.unam.mx, mjassor@iingen.unam.mx Abstract. We present the CIC-GIL approach to the author profiling (AP) task at MEX-A3T 2018. The task consists of two subtasks: identi- fication of authors’ location (6-way) and occupation (8-way) in a corpus of Mexican Spanish tweets. We used the logistic regression algorithm trained on typed character n-grams, function-word n-grams, and region- alisms for location identification, and typed character n-grams with sev- eral modifications for occupation identification. Our best run showed F1-macro score of 73.63% for location and 48.94% for occupation iden- tification. The results are competitive with other participating teams; in particular, our best run was ranked fourth in the shared task. Keywords: author profiling, location identification, occupation identi- fication, social media, n-grams, Spanish, machine learning 1 Introduction Author profiling (AP) is the task of identifying the author’s demographics, such as age, gender, personality traits, native language, place of residence, and oc- cupation based on a sample of his or her writing. AP is useful for a variety of purposes, including security, marketing, and forensic applications. The interest in the AP task is maintained through the annual organization of the PAN evaluation campaign3 – one of the main fora regarding tasks related with authorship analysis. The author profiling task at MEX-A3T [1] focuses on Mexican Spanish and covers two aspects of author profiling, which have not been previously explored in any related competitions: place of residence (location) and occupation of the author. We approach the task from a machine-learning perspective, as a multi-class classification problem. Further, we briefly describe the dataset used in the com- petition, and then focus on the applied pre-processing steps, features, and the configuration of our system. 3 http://pan.webis.de Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) 2 Data The training corpus provided by the organizers consists of 3,470 documents4 , and is quite imbalanced in terms of both location and occupation. The corpus statistics is provided in Table 1. It shows the number of documents (No. of docs) and the percentage (%) for each class, as well as the average (Avg.), minimum (Min.), and maximum (Max.) document length measured in terms of characters (after applying pre-processing steps described below). A more detailed description of the corpus can be found in [1]. Table 1. Training corpus statistics. Class No. of docs (%) Avg. length Min. length Max. length Location Center 1,252 (36%) 114,088 15 583,086 Northeast 911 (26%) 60,163 25 372,015 Northwest 575 (17%) 53,580 68 268,186 West 314 (9%) 114,127 72 544,112 Southeast 312 (9%) 106,647 995 363,068 North 106 (3%) 92,592 999 310,205 Occupation Student 1,637 (47%) 51,889 15 296,659 Administrative 624 (18%) 128,477 109 360,812 Social 563 (16%) 121,248 121 583,086 Arts 239 (8%) 116,794 346 301,275 Sciences 182 (5%) 115,148 560 321,545 Health 105 (3%) 103,072 495 310,205 Others 75 (2%) 131,580 999 359,889 Sports 45 (1%) 98,753 999 197,539 3 Methodology 3.1 Pre-processing steps Pre-processing has proved to be a useful strategy for the AP task [2]. We applied several pre-processing steps in order to aid typed character n-gram features to capture relevant information: (i) We performed lowercasing. (ii) We replaced all digits by the same symbol (e.g., 2,123 → 0,000), since we are not interested in the actual number, while the frequency of digits and their length may provide useful information to the classifier. (iii) We replaced user mentions (@user), user hashtag mentions (#tag), picture links, and URL mentions by different symbols (@user → 1, #tag → 2, picture link → 3, URL → 4) in order to keep information about their occurrence and remove information about the exact user/tag/link/URL. For location, we reduced user mentions, hashtag mentions, picture links, and URL mentions to the same symbol (@user → 1, #tag → 1, picture link → 1, URL → 1). (iv) We replaced slang words by their standardized version from the Spanish social media lexicon, as proposed in [2]. 4 We removed 30 empty documents from the dataset. 98 Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) 3.2 Features Typed character n-grams Typed character n-grams, i.e., character n-grams classified into 10 categories based on affixes, words, and punctuation [3] have proved to be indicative features for other subtasks of author profiling, e.g., na- tive language identification [4], identification of gender and language variety, including when different varieties of Spanish are concerned [5, 6]. For location identification, we used all categories of typed character n-grams (n = 4). For occupation, we conducted an ablation study in order to identify the most indicative typed character n-gram categories. We found that the middle- punctuation n-grams, which capture the frequency of punctuation marks, did not contribute to the result, and therefore were excluded. Additional weight, on the contrary, was assigned to the middle-word n-gram features (the most indica- tive category based on the ablation study), that is, we triplicated middle-word n-gram features: we used three different features, e.g., ‘eatu-1’, ‘eatu-2’, ‘eatu-3’, instead of one feature ‘eatu’. Function-word n-grams Function words (FWs) belong to a set of closed- class words and represent relations rather than propositional content. Examples of function words include articles, prepositions, determiners, conjunctions, and auxiliary verbs. FW n-grams are composed of n consecutive FWs omitting all the tokens in between. We used the set of 313 Spanish function words from the Natural Language Toolkit (NLTK)5 and built FW n-grams (n = 2). FW n-gram features were used only for location identification. Regionalisms We developed a lexicon of regionalisms (words commonly used in a particular geographic area) used in three regions of Mexico: North, Center, and South. The lexicon contains 614 regionalisms obtained from the following websites:6 – http://lexiquetos.org/chilanguismos/ – http://www.multimedios.com/telediario/tendencias/diccionario-basico-regio.html – http://www.chicaregia.com/2006/06/diccionario-vocabulario-regio/ – http://www.sobrino.net/Dzidzantun/d_yuc.htm We used the regionalisms from the lexicon as features for location identifica- tion. 3.3 Experimental setup Classifier We used the scikit-learn [7] implementation of the logistic regression (LR) algorithm, which showed higher results than other machine-learning algo- rithms we examined: SVM and multinomial Naive Bayes. 5 http://www.nltk.org 6 The lexicon is available upon request. 99 Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) Weighting We used term frequency (tf) weighting scheme, i.e., the number of times a term occurs in a document. Other weighting schemes we examined – binary, tf–idf, and log-entropy – deteriorated our results. Threshold In our primary run, we considered only those features that occur in at least 15 documents in the entire corpus and that occur in at least two doc- uments in the corpus. In our secondary run, we did not use any threshold and considered only those features that occur in at least two documents in the corpus. Evaluation For the evaluation of our system, we conducted experiments un- der 5-fold cross-validation on the training corpus measuring the results in terms of F1-macro score (the official metric). 4 Results The 5-fold cross-validation results (F1-macro score, %) for our two runs on the training data, as well as the official results on the test set for our team and the best results in the shared task are shown in Table 2. The number of features (No.) for each run, as well as the majority and bag-of-words (BoW) baselines are also provided. Our secondary run (without threshold) showed higher results on the test set for both location and occupation: 73.63% and 48.94%, respectively. The obtained results are much higher than the baselines. Concerning the contribution of the regionalisms to location identification, in our experiments on the training data these features contributed about 1.5% to the overall 5-fold cross-validation F1-score. Table 2. 5-fold cross-validation results on the training corpus (Train) and the official results on the test set (Test) in terms of F1-macro (%) for identification of location and occupation. The best results in the competition and our highest results are in bold typeface. Location Occupation Average Train Test No. Train Test No. Train Test Shared task best – 83.88 – – 51.22 – – 67.12 Majority baseline 8.84 – – 8.01 – – 8.43 – BoW baseline 57.85 – 2,180,163 35.96 – 2,180,163 46.91 – Primary run 72.90 73.10 175,088 45.46 47.27 306,647 59.18 60.19 Secondary run 71.84 73.63 666,078 44.21 48.94 1,382,696 58.03 61.29 5 Conclusions We presented the CIC-GIL approach for identification of location and occupa- tion in a corpus composed of Twitter messages in Mexican Spanish. Our simple 100 Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) approach achieved higher results than the baseline: 73.63% F1-macro score for location and 48.94% for occupation on the test set, and was placed fourth in the shared task. One of the directions for future work would be to use doc2vec document embeddings. This strategy has proved to be useful for the AP task [8]. We will also examine other approaches on this dataset, such as the statistical-based approach described in [9]. References 1. Álvarez-Carmona, M.Á., Guzmán-Falcón, E., Montes-y-Gómez, M., Escalante, H.J., Villaseñor-Pineda, L., Reyes-Meza, V., Rico-Sulayes, A.: Overview of MEX-A3T at IberEval 2018: Authorship and aggressiveness analysis in Mexican Spanish tweets. In: Notebook Papers of 3rd SEPLN Workshop on Evaluation of Human Language Technologies for Iberian Languages (IBEREVAL), Seville, Spain, September. (2018) 2. Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J., Sanchez-Perez, M.A., Chanona-Hernandez, L.: Improving feature representation based on a neural network for author profiling in social media texts. Computational Intelligence and Neuroscience 2016 (2016) 13 pages 3. Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n- grams are created equal: A study in authorship attribution. In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies, Denver, CO, USA, ACL (2015) 93–102 4. Markov, I., Chen, L., Strapparava, C., Sidorov, G.: CIC-FBK approach to native language identification. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, ACL (2017) 374–381 5. Markov, I., Gómez-Adorno, H., Sidorov, G.: Language- and subtask-dependent feature selection and classifier parameter tuning for author profiling. In: Working Notes Papers of the CLEF 2017 Evaluation Labs. Volume 1866 of CEUR Workshop Proceedings., Dublin, Ireland, CLEF and CEUR-WS.org (2017) 6. Gómez-Adorno, H., Markov, I., Baptista, J., Sidorov, G., Pinto, D.: Discriminating between similar languages using a combination of typed and untyped character n-grams and words. In: Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects, ACL (2017) 137–145 7. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (2011) 2825–2830 8. Markov, I., Gómez-Adorno, H., Posadas-Durán, J., Sidorov, G., Gelbukh, A.: Author profiling with doc2vec neural network-based document embeddings. In: Proceedings of the 15th Mexican International Conference on Artificial Intelligence, MICAI 2016. Volume 10062., Cancún, Mexico, Part II, LNAI, Springer (2017) 117–131 9. Markov, I., Gómez-Adorno, H., Sidorov, G., Gelbukh, A.: The winning approach to cross-genre gender identification in Russian at RUSProfiling 2017. In: FIRE 2017 Working Notes. Volume 2036 of FIRE’17., Bangalore, India, CEUR-WS.org (2017) 20–24 101