UOBIT @ TAG-it: Exploring a Multi-faceted Representation for Profiling Age, Topic and Gender in Italian Texts. Roberto Labadie Tamayo, Daniel Castro Castro and Reynier Ortega Bueno Computer Science Departament, University of Oriente Santiago de Cuba, Cuba roberto.labadie@estudiantes.uo.edu.cu, {danielcc, reynier}@uo.edu.cu Abstract their age, gender, personality or any other demo- graphic attribute. English. This paper describes our sys- Many forums, due to the applicability of AP, tem for participating in the TAG-it Au- share tasks directed to mining features that in thor Profiling task at EVALITA 2020. The general way, predict that valuable information. task aims to predict age and gender of Those tasks commonly make special focus on blogs users from their posts, as the topic popular languages such as English and Spanish. they wrote about. Our proposal combines Nevertheless, other languages are explored on learned representations by RNN at word important forums too, that is the case of EVALITA and sentence levels, Transformer Neural 1 , this one, promoting analysis of NLP tasks in Nets and hand-crafted stylistic features. the Italian language. Among the challenges from All these representations are mixed and its last campaign EVALITA 2018 was the AP fed into a fully connected layer from a (in terms of gender) task GxG (Dell’Orletta and feed-forward neural network in order to Nissim, 2018), exploring the gender-predicting make predictions for addressed subtasks. issue. Experimental results show that our model The analysis of age, gender and the topic a text achieves encouraging performance. is related with, are tasks well explored and the most approaches employ data representation The growing integration of social media with based on stylistic features, n-gram representations people’s daily live has made this medium a com- and/or words embedding combined with Machine mon environment for the deployment of technolo- Learning (ML) methods like Support Vector gies that allow the retrieval of useful information Machine (SVM) and Random Forest (Pizarro, in the development of business activities, social 2019). Also some authors by using Deep Learning outreach processes, forensic tasks, etc. That is be- (DL) models like Convolutional Neural Networks cause people frequently upload and share content (CNN) and Long-Short Term Memory (LSTM) in these media with various purposes such as so- combined with stylistic features (Aragón and cialization of points of view about some topic or López-Monroy, 2018) (Bayot and Gonçalves, promotion of personal business, etc. The analysis 2018) have yield encouraging performances. of textual information from such data, is one of the main reasons why researches become trending on In this work we address precisely, the automatic the Natural Language Processing (NLP) field. detection of gender and age of the authors, besides However, the fact that this information varies the identification of the prevailing topic on textual greatly in terms of its format, even when it comes information from blogs. Also, we describe our from the same person, besides textual sequences developed model for participating on TAG-it: are unstructured information, make challenging Topic, Age and Gender prediction for Italian2 the process of analyzing it automatically. Author (Cimino A., 2020) task at EVALITA 2020 (Basile Profiling (AP) task aims at discovering different et al., 2020). marks or patterns (linguistic or not) from texts, Having in account the proved ability of DL that allow a user to be characterized in terms of 1 Copyright © 2020 for this paper by its authors. Use per- http://www.evalita.it/ 2 mitted under Creative Commons License Attribution 4.0 In- https://sites.google.com/view/ ternational (CC BY 4.0). tag-it-2020 models to learn abstract depictions that are two different paradigms. The first ones analy- omitted in hand-crafted features engine meth- ses the information sequentially, token by token ods, our approach is mainly based on them, whereas the second ones analyze all these tokens particularly on Bi-LSTM and Transformer Nets at once, relating every one with respect to each (Vaswani et al., 2017). We combine the feature other. The opposite behavior of these two archi- representations learned by DL models, with hand- tectures implies learning different patterns which crafted ones based on Term Frequency-Inverse individually have proved to be an accurate way to Document Frequency (tf-idf) and stylistic features. synthesize the information. We hypothesize that making an ensemble of these This paper is organized as follow: in the next deep representations and fusing it with hand- section a brief description about the different sub- crafted ones as we show on Figure. 1 could yield tasks of TAG-it task. Next, we present our pro- encouraging results on the proposed tasks. posal. Specifically, we describe the data prepro- cessing as well as the DL methods and features used for depicting this data. Finally, the experi- mental setting, the experiments conducted and the results achieved. 1 TAG-it Tasks Three sub-task have been proposed on TAG-it task. • subtask 1: Toward to predict the gender, the age (as an age range, eg: 20-29) and the topic mentioned by the author given a collection of texts written by him/her from a blog, all this three dimensions at once. • subtask 2a: For predicting gender. Figure 1: Representations Ensemble • subtask 2b: For predicting age. The first representation (Transformer Block) For these tasks a training corpus of texts written by based on Bidirectional Representation from Trans- blogs users, with possibly multiple posts per user, formers (BERT) Architecture (Devlin et al., 2018). was provided. Each user information (i.e posts The second based on LSTM (Hochreiter and per user) varies in terms of its length and quantity, Schmidhuber, 1997) neural nets with self atten- and the data for each subtask is unbalanced mainly tion mechanism (Att-LSTM) by using words em- for gender and topic prediction tasks, which place bedding (Recurrent Word-Level Block). The third some complexity degree for the training stage of one, a condensed representation based on the com- the models for these classification tasks. bination of stylistic features and a vector with the tf-idf computation of some keys tokens from the 2 Our Proposal text (Stylistic Block). Finally (Recurrent Sentence- Deep Learning methods are capable to learn and Level Block), another representation based on Att- project relationships between elements within tex- LSTM, but at this time, analyzing the sequence in- tual information which are beyond the human ab- formation at sentence level. stract comprehension. Therefore the use of just All these representations are concatenated and fed hand-crafted representations may omit some im- into a dense layer, by using Leaky Rectified Linear portant patterns on textual information analysis. Unit (Leaky ReLU) activation function, to synthe- However, stylistic and linguistic features have size the extracted information on each block and proved to be good marks to determine some author its output vector goes to a softmax dense layer characteristics. Within the used DL models on AP which have the same number of neurons as classes field, are the LSTM (Labadie-Tamayo et al., 2020) on the analyzed task, in order to make the predic- and the Transformers Neural Nets, which rely on tions. For dealing with the three classification tasks we the entire input context. used the same architecture, but trained separately The original BERT model is trained with two sub- for each of them, with different targets attending tasks, one of them consisting on predict some to the task. masked words from a sentence and the other one consisting on predict if two sentences are consec- 2.1 Preprocessing utive in the given corpus text. In the preprocessing stage we concatenate the For the TAG-it tasks we employed a pre-trained posts corresponding to the same user, in order to BERT model on a multilingual corpus (multi- treat them as only one super-document, but be- lingual L-12 H-768 A-12)3 (Turc et al., 2019), tween each post we place a tag i.e h post i de- which is fed with the super-document sequence. noting the ending-beginning of them. Afterwards, From this model we just used the first two trans- the numbers and dates are recognized and replaced former blocks and as its output we keep the first by a corresponding wildcard which encodes the and last vectors from the input sequence encod- meaning of these special tokens. Then, the text is ing, which are concatenated. tokenized and morphologically analyzed by means Also we applied fine tuning on BERT, adding of FreeLing (Padró and Stanilovsky, 2012). an intermediate dense layer of 64 units by using For computing the stylistic and tf-idf vectors as Leaky ReLU activation function, and taking as tar- for feeding the deep models on prevailing topic get for training a multitask focus trying to make detection task, we removed the stop words from predictions for age, topic and gender tasks at once. the document and lemmatized the tokens to their canonical form. 2.3 Recurrent Word-Level Block The second representation block of our system is 2.2 Transformer Block. BERT based on LSTM nets. This block takes as input BERT (Bidirectional Encoder Representations a sequence of the preprocessed text information, from Transformers) is an architecture resulting of which is fed into an embedding layer, set up with applying a bidirectional training to the attention fixed weights from FastText (Grave et al., 2018) model Transformer, designed for language model- pretrained word embedding4 , obtaining from each ing. The Transformer model has two mechanisms, word of the sequence a vectorial representation. the first one, known as the encoder, which is fed The textual sequence is provided with relevant or with the text and finds out an encoded represen- not information with respect to the task in anal- tation for the sequence. The second one, the de- ysis. In order to highlight the most important ele- coder, produces the predicted tokens for language ments for encoding the message instead of making modeling one at time, having in account the en- the network pays attention to all elements alike, coder’s output and the previous predicted tokens the embedding layer output tokens are scored by on each time step. its relative importance over the other elements The main advantage of this transformer mod- on its context with Scaled Dot-Product Attention els w.r.t. traditional sequential architectures like Mechanism (Vaswani et al., 2017). Then, the Gated Recurrent Unit (GRU) (Cho et al., 2014) is new scored sequence is fed into a Bidirectional- that instead of analyzing the textual information in LSTM (BI-LSTM) (Schuster and Paliwal, 1997) one or another direction (e.g. right to left or left to layer with 64 neurons which perform two analy- right) it takes in account the entire information at sis over this sequence, in forward and backward once by using an attention mechanism, which re- directions, for detecting not just relations of an el- lates each word on the text with its surrounding ement with the previous ones, but also with the el- context. ements that appear after it. Afterwards, the hidden Since the goal of BERT is to generate a language states from the Bi-LSTM layer are considered as representation, only the encoder mechanism is a new sequence, which is fed into another LSTM necessary. It is structured with transformer blocks with 64 neurons too, taking from its output just connected sequentially and each transformer block the last hidden state, which represents the Recur- is composed by attention heads working in paral- 3 lel. These transformer blocks give to their subse- https://github.com/google-research/ bert quent layer one representation for each element of 4 https://fasttext.cc/docs/en/ the input text, but these representations correlates crawl-vectors.html rent Word-Level Block encoding. guistic layers. For training this block we applied dropout (Srivas- For constructing the first one we used a feature tava et al., 2014) to the neurons of the attention and selection approach which score every term em- LSTM layers in order to improve the generalizing ployed by users corresponding to some category capability of the model. within a classification task and then are selected the more relevant ones. 2.3.1 Scaled Dot-Product Attention For scoring the tokens we use IG (Sebastiani, This attention function at first, maps for each se- 2002) standing for Information Gain, which takes quence token three representations ( the query and into account the presence of a term in a category a key-value pair) for computing a compatibility in- as well as its absence. The information gain of a dex between every pair of elements. Afterwards, term t in a class C is defined as: for each token ti is evaluated its compatibility w.r.t every other sequence token tj by relating its query X X P (x, c) IG(t, C) = P (x, c) log2 vector qi with all the keys kj , then these compati- P (x)P (c) c∈{C,C̄} x∈{t,t̄} bilities cij are normalized with a softmax function (2) and used for scoring the value vectors vj in front In this formula, probabilities are interpreted on an of that specific query. Finally, the attention based event space of documents (e.g. P (t̄, C) indicates representation for ti is computed as the weighted the probability that, for a random document d, sum of these pondered values vectors. This com- term t does not occur in d and d belongs to putation is defined as follows: category C ). Once computed the IG for every term which be- Q × KT Attention(Q, V, K) = sof tmax( √ )×V longs to documents of the class ci , the 500 lc tokens dk with highest IG are chosen for characterizing (1) n×d n×d this class , where lc is the number of the task Where Q, K ∈ < k and V ∈ < v are matri- classes. Finally a 500 − dimensional vector is ces, which, on every row contain for query, key constructed where its components are computed and value respectively the mappings of the se- as the tf-idf of the representative terms from every quence tokens, n corresponds to the length of the class. sequence and dk , dv to the dimension of mapping vectors for key and value respectively. The second representation is computed in- 2.3.2 LSTM dependently of the addressed task as a 12 − LSTM networks are a special kind of RNNs, dimensional vector where its components are which are specialized on analyzing sequential real numbers corresponding to statistical values data. These have a main cell unit (the recurrent from lexical and syntactical linguistic layers (e.g unit) which explores the data sequence one ele- sentence, paragraph, syntactic layers) such as: ment at each time step (left to right order). This • Paragraph layer: Standard deviation of the network shares the information captured in pre- sentences’ length written by the user. vious steps, for computing the new hidden state at the current time step. Inside the main cell is • Text layer: Number of stop words used. contained a gate structure that informs to the net- work which information preserve or forget from • Sentence layer: Average of words’ length. the hidden sates of previous time steps for the cur- • Syntactic layer: Proportion of nouns over ad- rent computation. jective. 2.4 Stylistic Block. Stylistic Features These two representations are combined and fed The Representation based on stylistic features is into a 64-neurons dense layer to synthesize the in- twofold; in one side we consider for characteriz- formation and later being fused it with the other ing a user attending to some classification task, a blocks representations. vector containing the tf-idf of a set of key tokens from the text and on the other side we construct a 2.5 Recurrent Sentence-Level Block statistical style features vector which captures in- This block shares the same structure with the formation from distinct lexical and syntactical lin- Recurrent Word-Level Block, but instead to be fed with a sequence composed by word repre- an enormous data as more as possible, while we sentations provided by a word embedding layer, made the model focus on our addressed tasks, it is fed with a sequence resulting of encoding also we set the decay = 2e-3 to the learning rate each super-document’s sentence by means of scheduler. an encoder with a similar structure as the first analyzed Transformer-Block . We evaluate and select the hyper-parameters as the representation and features that we used for For this Recurrent Sentence-Level Block, we our model by using a cross-validation method to trained the sentence encoder with the same multi- obtain a more realistic an unbiased performance task focus as in the Transformer-Block , but aiming evaluation, making 5 splits for validation. On each to predict for each sentence from a document the cross validation step, the dataset was split in 20% annotated characteristics (i.e age and gender) of for validation and 80% for training, keeping the the user who it belongs to and the topic of its sur- distribution of examples relative to the split size. rounding text. Then we encode all the sentences The performance of the model on training stage from the super-document composed by the user’s was evaluated independently for each subtask by posts, and we considered them as tokens from a using different combinations of representations sequence at sentence level. Afterwards, that se- from Recurrent Word-Level Block (RNN-W), Re- quence is fed into a model with the same structure current Sentence-Level Block (RNN-S), Trans- Att-Bi-LSTM as the Recurrent Word-Level Block former Block (T) and Stylistic Block (STY). taking from this, as the user’s profile encoding, the For age and gender prediction we employed last hidden state from the second LSTM layer as in Micro-F1 metric whereas for topic prediction we the Word-Level block. used accuracy metric for the evaluation. In Ta- ble. 1 we summarize the results obtained in terms 3 Experiments and Results of the average of these metrics in cross-validation training. The dataset used in this work was the one provided As we can see, assembling the three deep repre- by the task organizers. This dataset is unbalanced, mainly for gender classification task, where the Table 1: Model Performance on training data. male class represents the 82.6% of the examples. In order to prevent a biased training of the model Age Gender Topic we applied a class-weighting method, scoring Model AVG-F1 AVG-F1 Acc the computed loss for every examples having RNN(S+W)-STY-T 0.378 0.941 0.935 in account the class which it belongs to (i.e RNN(S+W)-T 0.203 0.946 0.885 for examples from male class we give to the RNNS-STY-T 0.348 0.940 0.931 computed loss a weight of 0.3 whereas for female RNNW-STY-T 0.339 0.919 0.903 examples we pondered the loss to 0.7) this makes that when parameters are updated by means of the gradients, the models pays more attention sentations with the stylistic one, yield a good per- to the most weighted class, specifically to the formance in all cases through the cross-validation under-represented class. process. However, the stylistic representation had We pretrain the Transformer models from the a soft negative influence on gender prediction task. Transformer Block and the sentence encoder of the Recurrent Sentence-Level Block independently Regarding the official results, we submitted of the entire model and then we fixed the learned 3 runs as UOBIT team, on each of them we weights. employed the representations learned by the For fine tuning these BERT models we employ Transformer and Stylistic Blocks by tuning the Adam Optimizer, using categorical cross-entropy use of the Recurrent Blocks’ encode, as shown on loss function for every output layer, since we Table. 2. applied multi-task learning over two epochs. The learning rate for this training was set up to a After the evaluation phase we try to remove low value (lr=1e-5) since we wanted to keep the the stylistic features based representation and we parameters learned from the original train with found out that this representation, possibly be- Table 2: Model Performance on test data. Subtask 1 Subtask 2a Subtask 2b run Model Metric 1 Metric 2 Micro-F1 Micro-F1 run-1 RNN-W T STY 0.686 0.251 0.852 0.278 run-2 RNN-S T STY 0.674 0.243 0.883 0.370 run-3 RNN-W RNN-S T STY 0.699 0.251 0.893 0.308 Unofficial - RNN-W RNN-S T 0.680 0.248 0.898 0.4680 - RNN-W RNN-S 0.667 0.243 0.893 0.369 - T 0.436 0.067 0.835 0.283 cause of it introduces some noise, makes the model to have a worst performance, at least on This four representations are mixed and fed into those tasks related to the author attributes (i.e gen- a dense layer for synthesize them and its output der and age) corresponding to task 2a and task 2b. is received by another dense layer which classify We think that noise introduced by these features this profile taking into account the classes from the mainly comes from the fact that they are computed addressed subtask. based on key tokens from the text, these tokens The results shown that considering both the may suggest to the model that texts with same stylistic representation and the deep representa- topic belongs to the same class within gender or tions learned by Recurrent and Transformer mod- age classification task. els we obtain the best effectiveness based on the The performance of our system just by using the accuracy measure for the task related to the topic deep representations of the Recurrent and Trans- classification, but this behavior changed for age former Blocks, yield a performance of 0.4606 un- and gender classification, due to the relationship der F1 metric on subtask 2b which improves the of syntactic structures of the text with the topic ones reached by the best team of 0.409, whereas that the user’s posts are related to. We think that this same combination improves our best official excluding the stylistic features or at least those run on subtask 2a. These results are shown on Ta- related to the frequency of tokens from the text, ble. 2 under the row named Unofficial. could be a way to increase the effectiveness of the ensemble, mainly on the age detection subtask. 4 Conclusions Also analyzing the content of the posts at charac- ter level, due to the informal text origin, would In this paper we described our system for par- solve the problem of missidentification of some ticipating in the TAG-it Author Profiling task at key words within te text. We would like to explore EVALITA 2020. Our proposal is based on an these ideas in future work. ensemble of RNN, Transformer Neural Nets and hand-crafted stylistic features. The system re- ceives as input a user’s profile textual information References as an only one super document (sequence), this information is encoded in four different ways, Mario Ezra Aragón and A-Pastor López-Monroy. 2018. A straightforward multimodal approach for the first one by a Transformer Block, specifically author profiling. In Proceedings of the Ninth In- a fine tuned and reduced BERT model, the ternational Conference of the CLEF Association second one, by a Recurrent Block based on an (CLEF 2018). Attention-Bi-LSTM model analyzing the infor- Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- mation at word level, the third one by a feature cia C. Passaro. 2020. Evalita 2020: Overview representation based on the combination of tf-idf of the 7th evaluation campaign of natural language information and stylistic features extracted from processing and speech tools for italian. In Valerio the text. Finally the fourth one by the same Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of Seventh Evalua- recurrent structure as in the Recurrent Worf-Level tion Campaign of Natural Language Processing and Block, but analyzing the information at sentence Speech Tools for Italian. Final Workshop (EVALITA level. 2020), Online. CEUR.org. Roy Khristopher Bayot and Teresa Gonçalves. 2018. Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Multilingual author profiling using lstms: Notebook Toutanova. 2019. Well-read students learn better: for pan at clef 2018. In CLEF (Working Notes). On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962v2. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Schwenk, and Yoshua Bengio. 2014. Learning Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz phrase representations using rnn encoder-decoder Kaiser, and Illia Polosukhin. 2017. Attention is all for statistical machine translation. arXiv preprint you need. In Advances in neural information pro- arXiv:1406.1078. cessing systems, pages 5998–6008. Nissim M. Cimino A., Dell’Orletta F. 2020. Tag- it@evalita2020: Overview of the topic, age, and gender prediction task for italian. Felice Dell’Orletta and Malvina Nissim. 2018. Overview of the evalita 2018 cross-genre gender prediction (gxg) task. EVALITA Evaluation of NLP and Speech Tools for Italian, 12:35. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language under- standing. arXiv preprint arXiv:1810.04805. Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- mand Joulin, and Tomas Mikolov. 2018. Learn- ing word vectors for 157 languages. In Proceed- ings of the International Conference on Language Resources and Evaluation (LREC 2018). Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780. Roberto Labadie-Tamayo, Daniel Castro-Castro, and Reynier Ortega-Bueno. 2020. Fusing Stylistic Features with Deep-learning Methods for Profiling Fake News Spreader—Notebook for PAN at CLEF 2020. In Linda Cappellato, Carsten Eickhoff, Nicola Ferro, and Aurélie Névéol, editors, CLEF 2020 Labs and Workshops, Notebook Papers. CEUR-WS.org, September. Lluı́s Padró and Evgeny Stanilovsky. 2012. Freeling 3.0: Towards wider multilinguality. In Proceedings of the Language Resources and Evaluation Confer- ence (LREC 2012), Istanbul, Turkey, May. ELRA. Juan Pizarro. 2019. Using n-grams to detect bots on twitter. In CLEF (Working Notes). Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE Trans. Sig- nal Process., 45(11):2673–2681. Fabrizio Sebastiani. 2002. Machine learning in auto- mated text categorization. ACM computing surveys (CSUR), 34(1):1–47. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi- nov. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958.