UO_4to @ TAG-it 2020: Ensemble of Machine Learning Methods Maria Fernanda Artigas Herold Daniel Castro Castro Computer Science Department, Universidad Computer Science Department, Universidad de Oriente, Santiago de Cuba, Cuba de Oriente, Santiago de Cuba, Cuba nanda.ah@nauta.cu danielcc@uo.edu.cu Author Profiling (AP) is the main branch of Abstract NLP that studies the analysis of information to determine several demographic aspects of author This paper describes the proposal pre- such as age and gender given a set of documents sented in the TAG-it author profiling task presumably written by him, and recently some from EVALITA 2020 for sub-task 1. The aspects such as the personality and occupation main objective is to predict gender and have also been included. The increased integra- age of some blog users by their posts, as tion of social media in people’s daily lives have well as topic they wrote about. Our pro- made them a rich source of textual data for au- posal uses an ensemble of machine learn- thor profiling since data could be mined from the ing algorithms with three of the most web, including emails and blogs, but there are used classifiers and language model of still limitations in using social media as data the n-grams of characters represented in a source because data obtained may not always be Bag of Word. To face this task we pre- reliable or accurate. Users used to provide false sented two different strategies aimed at information about themselves that difficult the finding the best possible results. correct development of the task. Document classification, also known as text 1 Introduction tagging, is currently one of the most important subtask of Text Mining and NLP where the gen- With the growing development of technology eral idea is assign automatically one or more and the frequent use of new forms of interactions classes or categories in a set of predefined tags to and communications, Internet users spend more a document using machine learning algorithms time sharing their ideas, thoughts, feelings and based on its content. Documents may be classi- interests through social networks with diverse fied according to the subject, author or any other purposes, whether of personal businesses, self- class that could be of interest in the research, as expression, socialization, scientific, commercial, well as age and gender. etc. In social media people often share their per- Recognized by the community, there is a theo- sonal data, contact information, jobs, criteria and, retical evaluation framework, known as PAN 1 , in general, very useful information that can be which encompasses authorship detection, author used in research purposes about the behavior of profiling, sentiment analysis, among others. On people, development of marketing strategies and this platform, people can present and share their political campaigns, to serve various forensics work, find out about the topics covered in previ- applications, as well as strategies to determine ous works and participate in the tasks that are certain demographic attributes of the person such proposed each year for the community. as age, sex, characteristics of personality, geo- graphic origins and even their occupation. Precisely, one of the purposes of Natural Lan- Copyright c 2020 for this paper by its authors. Use guage Processing (NLP) research is to analyze permitted under Creative Commons License Attribu- the information obtained from users to create tion 4.0 International (CC BY 4.0). systems capable of extracting significant charac- teristics and improving the automatic under- 1 https://pan.webis.de/ standing of written text. In 2019, at the PAN@CLEF evaluation forum This year, TAG-it: Topic, Age and Gender (Rangel and Rosso, 2019), it was presented the Prediction for Italian from EVALITA (Cimino, Bots an Gender author profiling tasks, whose 2020) propose three different sub-task of AP. objective was determine if the author of a Twitter The first one (subtask1) with the aim of predict- feed, in Spanish or English, had been written by ing gender, age (in an age range, eg: 30-39) and a robot or a human, and in case of human, the the topic treated by the author given a collection gender should also be determined. To resolve of documents written by him/her in a blog, the this task, organizers proposed a set of baselines three classes at once. The second one (sub- with models of n-grams of characters and words task2a): for predicting gender only, and the third representation with a vocabulary reduction vary- one (subtask2b): for predicting age. ing the parameters according to a few certain of For this task, a training corpus composed by configurations. texts written by users in a blog was offered, Another forum where the subject of author where each user has multiple posts. The infor- profiling has been worked on is MexA3T2, an- mation per user varies in length and quantity, in other domain different from PAN for Spanish addition to the fact that the data is unbalanced for variants where generally works with the analysis each class, which is not helpful for the training in of Mexican tweets. In 2019, it was proposed the classification task models. MexA3T task for Author Profiling and Aggres- siveness analysis focused on Mexican tweets 2.1 Our method (Aragón, 2019) as a follow-up of the task pro- According to the data corpus provided, our posed in 2018 (Álvarez, 2018). The AP task proposal is focused on classifying documents comprises the detection of Place of Residence, using a Bag of Word of n-grams characters rep- Occupation and Gender of an user profile based resentation, a feature reduction by a predefined on the set of tweets written by him. An user number and an ensemble of machine learning profile was distributed not only using the text of algorithms: Random Forest, Support Vector Ma- the tweets, but also images were incorporated on chine (SVM) and Centroid Nearest Neighbor the profiles. classifiers, see Figure.1. We also consider Tf or Several authors base their approaches on fea- a Tf-Idf as the weight of features. ture engineering and traditional machine learning We participate in the subtask1 where we pre- classifiers. In previous works, methods have sent two different strategies. First we adjust the been proposed that work with comprising con- values of the parameters n for numbers of n- tent-based (bag of words, word n-grams, term grams, k for feature reduction and the calculation vectors, dictionary words), feature reduction of TF-IDF or not to the classification of each (Castro, 2019) where the most used technique profile independently using a different configura- has been the selection of a subset of the most tion in each one according to the best results ob- frequent features, stylistic-based features (fre- tained in the individual classification. In the sec- quencies, punctuation, POS, Twitter-specific el- ond proposal we adjust a general parameter and ements, slang words) and approaches based on use the same configuration in the three profiles neural networks (CNN, LSTM) (Valdez, 2019). classification. 2 TAG-it 2020 Despite the fact that Text Mining and NLP tasks focus a lot on the most used languages such as English and Spanish, others languages are also widely covered in several important forums. EVALITA 3 is a platform which promotes NLP tasks specifically for Italian language providing a shared framework where different systems and approaches can be evaluated in a consistent manner that has been working since 2007. 2 https://sites.google.com/view/mex-a3t/ 3 http://www.evalita.it/ Figure.1 Ensemble architecture representation. To represent the documents in a Bag of Word 3 Experiments and Results (BoW) model, we segment and preprocess the corpus and construct a vector of n-grams of char- The test dataset provided by the tasks organizers acters ordered from highest to lowest by their was similar to train corpus (which was unbal- respective frequency in the text per document. anced especially for gender class, with a predom- The parameters that we established for each con- inance of male users), and it was composed by figuration were: the n-grams of character repre- posts of 411 different users with unknown age, sentation, a size n from 1 to 5 characters and a gender and topic classes. number of 100, 500 and 1000 for feature reduc- To obtain the best possible results with our tion. Also for the weighing of the elements was method, we realized several experiments varying considered the calculation of TF or TF-IDF, de- the values of the parameters in order to deter- pending on the case, defined as follow: mine a good configuration per class. At the end of the experimentation process, we choose two 𝑛𝑖𝑗 different runs to be presented. The first one 𝑡𝑓𝑖𝑗 = (Team2_1_1), see in Table.1, has a different con- ∑𝑘 𝑛𝑖𝑗 figuration per class according to the best ob- And TF-IDF value was defined as: tained result in the individual classification. Age class has been represented with a configuration 𝑁 of 2-grams of characters, a 1000 feature reduc- 𝑤𝑖𝑗 = 𝑡𝑓𝑖𝑗 × log( ) tion and with TF-IDF as the weight of features. 𝑑𝑓𝑖 Gender class has been represented with a config- uration of 4-grams of characters, a 1000 feature Where 𝑡𝑓𝑖𝑗 is the frequency of the token 𝑖 in reduction and TF as the weight of features and the document 𝑗, 𝑑𝑓𝑖 is the number of documents Topic class has been represented with 4-grams of that contain the token 𝑖 and N is the total number characters, a 1000 feature reduction and TF-IDF of documents per user. as the weight of features. For machine learning algorithms we used the Using the Strified-K-Fold Cross-Validation implementations that are arranged in Python we obtain as a result of the individual evaluation sklearn library and among them we have Ran- per class 0.3732, 0.8854 and 0.7051 for age, domForestClassifier, NearestCentroid and On- gender and topic respectively. eVsOneClassifier for the three classifiers used in In the second run (Team2_1_2), see in Ta- the ensemble. ble.1, we have adjusted the parameters to be the To determine the definitive class to which a same in the three classes and use a single config- set of documents belongs with the ensemble of uration in all: 4-grams of characters, a 1000 fea- classifiers, we use a majority voting method, ture reduction and TF for the weight of features. which consist of considering as the class of the Using the two metrics given in the TAG-it document that which has been predicted by the page we evaluate the second run and obtain largest number of classifiers. 0.6801 and 0.2914 for Metric 1 and Metric 2 re- For the validation process we use the Strati- spectively as result. fiedKFold from sklearn.model_selection module to perform a 5-Strified-K-Fold validation whit the training corpus which is divided into train Run Metric 1 Metric 2 and test respectively to be able to evaluate the Team1_1_3 0,6991 0,2506 effectiveness of the system. As an evaluation Team1_2_3 0,6739 0,2433 metrics we use F1 score for Topic an Age di- Team1_3_3 0,6991 0,2506 mensions and for Gender we use Accuracy score Team2_1_1 0,4160 0,0924 from sklearn library in the first run. For second Team2_1_2 0,4436 0,0924 run we use the two different rankings proposed Team3_1_1 0,6626 0,2530 in the task to evaluate the participants: ranking 1 Team3_1_2 0,7177 0,3090 which evaluate the performance of each system Team3_1_3 0,7347 0,3309 using a partial scoring scheme, giving 1/3 of the Table.1 Competition results for subtask 1. points for each correctly predicted profile and 0 points if neither is correct; and ranking 2 which The results obtained were not as good as ex- gives 1 point only if all classes are well predicted pected compared with the results obtained in the and 0 otherwise. validation process that we made, considering that n-gram of character representation obtained low Cimino A., Dell´Oreleta F., Nissim M. (2020). “TAG- scores for topic and age classification. it@EVALITA2020: Overview of the Topic, Age, and Gender prediction task for italian”. In Proce- 4 Conclusion edings of the Seventh Evaluation Campaign of Na- tural Language Processing and Speech Tools for In this paper we described the proposal presented Italian. to participate in the TAG-it author profiling task from EVALITA 2020. Our proposal is based on an ensemble of machine learning algorithms with three well known classifiers and a Bag of Word of characters n-grams using a feature reduction by a predefined parameter and calculating TF or TF-IDF for features weight. To resolve subtask 1 we proposed two different strategies where we first adjust the values of the parameters n for n-grams, k for feature reduction and Tf or TF-IDF for feature weight to the classi- fication of each profile independently using a different configuration in each one, and in the second we just adjust a general parameter and use the same configuration in the three profiles classification at once. Despite that the fact that in the evaluation pro- cess we carried out obtained better scores, the results of the task were not as good as expected, since low results were obtained for topic and gender dimension. Reference Francisco M. Rangel Pardo, Paolo Rosso: Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter. CLEF (Working Notes) 2019 Mario Ezra Aragón, Miguel Ángel Álvarez Carmona, Manuel Montes-y-Gómez, Hugo Jair Escalante, Luis Villaseñor Pineda, Daniela Moctezuma: Overview of MEX-A3T at IberLEF 2019: Author- ship and Aggressiveness Analysis in Mexican Spanish Tweets. IberLEF@SEPLN 2019: 478-494 Miguel Á. Álvarez-Carmona, Estefanía Guzmán- Falcón, Manuel Montes-y-Gómez, Hugo Jair Es- calante, Luis Villaseñor-Pineda, Verónica Reyes- Meza, Antonio Rico-Sulayes: Overview of MEX- A3T at IberLEF 2018: Authorship and Aggressive- ness Analysis in Mexican Spanish Tweets. Iber- LEF@SEPLN 2018 Valdez-Rodríguez, J.E., Calvo, H., Felipe-Riverón, E.M.: Author profiling from images using 3d con- volutional neural networks. In: In Proceedings of the First Workshop for Iberian Languages Evalua- tion Forum (IberLEF 2019), CERUR WS Proceed- ings (2019) Daniel Castro Castro, Maria Fernanda Artigas Herold, Reynier Ortega Bueno, Rafael Muñoz: Cerpamid- UA at MexA3T 2019: Transition Point Proposal.