UO_4to @ TAG-it 2020: Ensemble of Machine Learning Methods


    Maria Fernanda Artigas Herold                               Daniel Castro Castro
Computer Science Department, Universidad               Computer Science Department, Universidad
   de Oriente, Santiago de Cuba, Cuba                     de Oriente, Santiago de Cuba, Cuba
         nanda.ah@nauta.cu                                    danielcc@uo.edu.cu


                                                         Author Profiling (AP) is the main branch of
                    Abstract                           NLP that studies the analysis of information to
                                                       determine several demographic aspects of author
    This paper describes the proposal pre-             such as age and gender given a set of documents
    sented in the TAG-it author profiling task         presumably written by him, and recently some
    from EVALITA 2020 for sub-task 1. The              aspects such as the personality and occupation
    main objective is to predict gender and            have also been included. The increased integra-
    age of some blog users by their posts, as          tion of social media in people’s daily lives have
    well as topic they wrote about. Our pro-           made them a rich source of textual data for au-
    posal uses an ensemble of machine learn-           thor profiling since data could be mined from the
    ing algorithms with three of the most              web, including emails and blogs, but there are
    used classifiers and language model of             still limitations in using social media as data
    the n-grams of characters represented in a         source because data obtained may not always be
    Bag of Word. To face this task we pre-             reliable or accurate. Users used to provide false
    sented two different strategies aimed at           information about themselves that difficult the
    finding the best possible results.                 correct development of the task.
                                                         Document classification, also known as text
1    Introduction                                      tagging, is currently one of the most important
                                                       subtask of Text Mining and NLP where the gen-
With the growing development of technology             eral idea is assign automatically one or more
and the frequent use of new forms of interactions      classes or categories in a set of predefined tags to
and communications, Internet users spend more          a document using machine learning algorithms
time sharing their ideas, thoughts, feelings and       based on its content. Documents may be classi-
interests through social networks with diverse         fied according to the subject, author or any other
purposes, whether of personal businesses, self-        class that could be of interest in the research, as
expression, socialization, scientific, commercial,     well as age and gender.
etc. In social media people often share their per-        Recognized by the community, there is a theo-
sonal data, contact information, jobs, criteria and,   retical evaluation framework, known as PAN 1 ,
in general, very useful information that can be        which encompasses authorship detection, author
used in research purposes about the behavior of        profiling, sentiment analysis, among others. On
people, development of marketing strategies and        this platform, people can present and share their
political campaigns, to serve various forensics        work, find out about the topics covered in previ-
applications, as well as strategies to determine       ous works and participate in the tasks that are
certain demographic attributes of the person such      proposed each year for the community.
as age, sex, characteristics of personality, geo-
graphic origins and even their occupation.
  Precisely, one of the purposes of Natural Lan-
                                                       Copyright c 2020 for this paper by its authors. Use
guage Processing (NLP) research is to analyze
                                                       permitted under Creative Commons License Attribu-
the information obtained from users to create          tion 4.0 International (CC BY 4.0).
systems capable of extracting significant charac-
teristics and improving the automatic under-           1
                                                           https://pan.webis.de/
standing of written text.
   In 2019, at the PAN@CLEF evaluation forum            This year, TAG-it: Topic, Age and Gender
(Rangel and Rosso, 2019), it was presented the        Prediction for Italian from EVALITA (Cimino,
Bots an Gender author profiling tasks, whose          2020) propose three different sub-task of AP.
objective was determine if the author of a Twitter    The first one (subtask1) with the aim of predict-
feed, in Spanish or English, had been written by      ing gender, age (in an age range, eg: 30-39) and
a robot or a human, and in case of human, the         the topic treated by the author given a collection
gender should also be determined. To resolve          of documents written by him/her in a blog, the
this task, organizers proposed a set of baselines     three classes at once. The second one (sub-
with models of n-grams of characters and words        task2a): for predicting gender only, and the third
representation with a vocabulary reduction vary-      one (subtask2b): for predicting age.
ing the parameters according to a few certain of          For this task, a training corpus composed by
configurations.                                       texts written by users in a blog was offered,
   Another forum where the subject of author          where each user has multiple posts. The infor-
profiling has been worked on is MexA3T2, an-          mation per user varies in length and quantity, in
other domain different from PAN for Spanish           addition to the fact that the data is unbalanced for
variants where generally works with the analysis      each class, which is not helpful for the training in
of Mexican tweets. In 2019, it was proposed the       classification task models.
MexA3T task for Author Proﬁling and Aggres-
siveness analysis focused on Mexican tweets           2.1   Our method
(Aragón, 2019) as a follow-up of the task pro-           According to the data corpus provided, our
posed in 2018 (Álvarez, 2018). The AP task            proposal is focused on classifying documents
comprises the detection of Place of Residence,        using a Bag of Word of n-grams characters rep-
Occupation and Gender of an user proﬁle based         resentation, a feature reduction by a predefined
on the set of tweets written by him. An user          number and an ensemble of machine learning
proﬁle was distributed not only using the text of     algorithms: Random Forest, Support Vector Ma-
the tweets, but also images were incorporated on      chine (SVM) and Centroid Nearest Neighbor
the proﬁles.                                          classifiers, see Figure.1. We also consider Tf or
   Several authors base their approaches on fea-      a Tf-Idf as the weight of features.
ture engineering and traditional machine learning        We participate in the subtask1 where we pre-
classifiers. In previous works, methods have          sent two different strategies. First we adjust the
been proposed that work with comprising con-          values of the parameters n for numbers of n-
tent-based (bag of words, word n-grams, term          grams, k for feature reduction and the calculation
vectors, dictionary words), feature reduction         of TF-IDF or not to the classification of each
(Castro, 2019) where the most used technique          profile independently using a different configura-
has been the selection of a subset of the most        tion in each one according to the best results ob-
frequent features, stylistic-based features (fre-     tained in the individual classification. In the sec-
quencies, punctuation, POS, Twitter-specific el-      ond proposal we adjust a general parameter and
ements, slang words) and approaches based on          use the same configuration in the three profiles
neural networks (CNN, LSTM) (Valdez, 2019).           classification.
2      TAG-it 2020
Despite the fact that Text Mining and NLP tasks
focus a lot on the most used languages such as
English and Spanish, others languages are also
widely covered in several important forums.
EVALITA 3 is a platform which promotes NLP
tasks specifically for Italian language providing a
shared framework where different systems and
approaches can be evaluated in a consistent
manner that has been working since 2007.


2
    https://sites.google.com/view/mex-a3t/
3
    http://www.evalita.it/                              Figure.1 Ensemble architecture representation.
   To represent the documents in a Bag of Word         3     Experiments and Results
(BoW) model, we segment and preprocess the
corpus and construct a vector of n-grams of char-      The test dataset provided by the tasks organizers
acters ordered from highest to lowest by their         was similar to train corpus (which was unbal-
respective frequency in the text per document.         anced especially for gender class, with a predom-
The parameters that we established for each con-       inance of male users), and it was composed by
figuration were: the n-grams of character repre-       posts of 411 different users with unknown age,
sentation, a size n from 1 to 5 characters and a       gender and topic classes.
number of 100, 500 and 1000 for feature reduc-            To obtain the best possible results with our
tion. Also for the weighing of the elements was        method, we realized several experiments varying
considered the calculation of TF or TF-IDF, de-        the values of the parameters in order to deter-
pending on the case, defined as follow:                mine a good configuration per class. At the end
                                                       of the experimentation process, we choose two
                            𝑛𝑖𝑗                        different runs to be presented. The first one
                  𝑡𝑓𝑖𝑗 =                               (Team2_1_1), see in Table.1, has a different con-
                           ∑𝑘 𝑛𝑖𝑗
                                                       figuration per class according to the best ob-
  And TF-IDF value was defined as:                     tained result in the individual classification. Age
                                                       class has been represented with a configuration
                              𝑁                        of 2-grams of characters, a 1000 feature reduc-
             𝑤𝑖𝑗 = 𝑡𝑓𝑖𝑗 × log( )                       tion and with TF-IDF as the weight of features.
                              𝑑𝑓𝑖
                                                       Gender class has been represented with a config-
                                                       uration of 4-grams of characters, a 1000 feature
   Where 𝑡𝑓𝑖𝑗 is the frequency of the token 𝑖 in
                                                       reduction and TF as the weight of features and
the document 𝑗, 𝑑𝑓𝑖 is the number of documents         Topic class has been represented with 4-grams of
that contain the token 𝑖 and N is the total number     characters, a 1000 feature reduction and TF-IDF
of documents per user.                                 as the weight of features.
   For machine learning algorithms we used the            Using the Strified-K-Fold Cross-Validation
implementations that are arranged in Python            we obtain as a result of the individual evaluation
sklearn library and among them we have Ran-            per class 0.3732, 0.8854 and 0.7051 for age,
domForestClassifier, NearestCentroid and On-           gender and topic respectively.
eVsOneClassifier for the three classifiers used in        In the second run (Team2_1_2), see in Ta-
the ensemble.                                          ble.1, we have adjusted the parameters to be the
   To determine the definitive class to which a        same in the three classes and use a single config-
set of documents belongs with the ensemble of          uration in all: 4-grams of characters, a 1000 fea-
classifiers, we use a majority voting method,          ture reduction and TF for the weight of features.
which consist of considering as the class of the          Using the two metrics given in the TAG-it
document that which has been predicted by the          page we evaluate the second run and obtain
largest number of classifiers.                         0.6801 and 0.2914 for Metric 1 and Metric 2 re-
   For the validation process we use the Strati-       spectively as result.
fiedKFold from sklearn.model_selection module
to perform a 5-Strified-K-Fold validation whit
the training corpus which is divided into train        Run               Metric 1          Metric 2
and test respectively to be able to evaluate the       Team1_1_3       0,6991              0,2506
effectiveness of the system. As an evaluation          Team1_2_3       0,6739              0,2433
metrics we use F1 score for Topic an Age di-           Team1_3_3       0,6991              0,2506
mensions and for Gender we use Accuracy score          Team2_1_1       0,4160              0,0924
from sklearn library in the first run. For second      Team2_1_2       0,4436              0,0924
run we use the two different rankings proposed         Team3_1_1       0,6626              0,2530
in the task to evaluate the participants: ranking 1    Team3_1_2       0,7177              0,3090
which evaluate the performance of each system          Team3_1_3       0,7347              0,3309
using a partial scoring scheme, giving 1/3 of the        Table.1 Competition results for subtask 1.
points for each correctly predicted profile and 0
points if neither is correct; and ranking 2 which        The results obtained were not as good as ex-
gives 1 point only if all classes are well predicted   pected compared with the results obtained in the
and 0 otherwise.                                       validation process that we made, considering that
n-gram of character representation obtained low        Cimino A., Dell´Oreleta F., Nissim M. (2020). “TAG-
scores for topic and age classification.                 it@EVALITA2020: Overview of the Topic, Age,
                                                         and Gender prediction task for italian”. In Proce-
4    Conclusion                                          edings of the Seventh Evaluation Campaign of Na-
                                                         tural Language Processing and Speech Tools for
In this paper we described the proposal presented        Italian.
to participate in the TAG-it author profiling task
from EVALITA 2020. Our proposal is based on
an ensemble of machine learning algorithms with
three well known classifiers and a Bag of Word
of characters n-grams using a feature reduction
by a predefined parameter and calculating TF or
TF-IDF for features weight.
  To resolve subtask 1 we proposed two different
strategies where we first adjust the values of the
parameters n for n-grams, k for feature reduction
and Tf or TF-IDF for feature weight to the classi-
fication of each profile independently using a
different configuration in each one, and in the
second we just adjust a general parameter and
use the same configuration in the three profiles
classification at once.
  Despite that the fact that in the evaluation pro-
cess we carried out obtained better scores, the
results of the task were not as good as expected,
since low results were obtained for topic and
gender dimension.

                   Reference
Francisco M. Rangel Pardo, Paolo Rosso:
   Overview of the 7th Author Profiling Task at PAN
   2019: Bots and Gender Profiling in Twitter. CLEF
   (Working Notes) 2019
Mario Ezra Aragón, Miguel Ángel Álvarez Carmona,
  Manuel Montes-y-Gómez, Hugo Jair Escalante,
  Luis Villaseñor Pineda, Daniela Moctezuma:
  Overview of MEX-A3T at IberLEF 2019: Author-
  ship and Aggressiveness Analysis in Mexican
  Spanish Tweets. IberLEF@SEPLN 2019: 478-494
Miguel Á. Álvarez-Carmona, Estefanía Guzmán-
  Falcón, Manuel Montes-y-Gómez, Hugo Jair Es-
  calante, Luis Villaseñor-Pineda, Verónica Reyes-
  Meza, Antonio Rico-Sulayes: Overview of MEX-
  A3T at IberLEF 2018: Authorship and Aggressive-
  ness Analysis in Mexican Spanish Tweets. Iber-
  LEF@SEPLN 2018
Valdez-Rodríguez, J.E., Calvo, H., Felipe-Riverón,
  E.M.: Author profiling from images using 3d con-
  volutional neural networks. In: In Proceedings of
  the First Workshop for Iberian Languages Evalua-
  tion Forum (IberLEF 2019), CERUR WS Proceed-
  ings (2019)
Daniel Castro Castro, Maria Fernanda Artigas Herold,
  Reynier Ortega Bueno, Rafael Muñoz: Cerpamid-
  UA at MexA3T 2019: Transition Point Proposal.