-

CerpamidUA at MexA3T 2019: Transition Point Proposal

Daniel Castro Castro

Mar a Fernanda Artigas Herold

maria.artigas@estudiantes.uo.edu.cu 2

Reynier Ortega Bueno

reynier.ortegag@cerpamid.co.cu 0

Rafael Mun~oz

rafael@dlsi.ua.es 1 0 Center for Pattern Recognition and Data Mining , Cuba 1 Department of Software and Computing systems, Alicante University , Spain 2 Oriente University , Cuba

2019

502 507

Author Pro ling is an important eld for detection of demographic characteristics of users based on texts written by him. Our main contribution is focused in determining a reduced subset of features that represent frequent lexical words for each pro le of Mexican twitters. The new subset of features was obtained considering the frequency of words in a pro le (e.g.: students), employing the theory of Transition Points. All the objects are represented in this new feature space conformed by all the reduced subset computed for each class or pro le. The classi cation phase was carried out using Support Vector Machines provided by the Weka platform. The results obtained were good for Gender, but needs more e orts for Location and Occupation, because, the main factor that a ects the results correspond to scenarios with unbalanced class distribution that impact the construction of the reduced vocabulary.

Author Pro ling Transition Point Mexican Twitter Proling

The modern society is characterized by an impressive use of digital technology and in particular to socialize using Social Network platforms in which emotions, ideas, new information, etc, are expressed. Users share their information using image, text, videos and other resources. All the available public information of an user, and in particular text and image, could be used to determine demographic attributes of him, such as, gender, age, personality, level of scholarship and others, and this is the key question in study in the eld of Author Pro ling (AP) analysis.

In 2018, it was proposed the MexA3T task for Author Pro ling and Aggressiveness analysis focused on Mexican tweets [ 3 ]. The AP task comprises the detection of Place of Residence and Occupation of an user pro le based on the set of tweets written by him. As it was exposed in the overview [ 3 ], it was a challenging task and for that reason they relaunch a similar task; including the analysis of Gender characteristics.

An important di erence of this year [ 1 ] with respect to the previous task is that an user pro le is distributed not only using the text of the tweets, but also images were incorporated on the pro les. This will allow the use of Text and Image for pro ling classi cation and it is not necessary to use both information. The principal evaluation Forum for Authorship Analysis over several years has been the PAN Lab at CLEF and in particular it has evaluated the AP [ 5 ] task considering the identi cation of Gender, Personality, Age, etc.

In MexA3T 2018 AP task, participated 4 teams [ 9 ] [ 2 ] [ 6 ] [ 8 ], the majority of them used an approach based on SVM classi cation and representation of text employing as features n-grams of character and lexical tokens. The MXAA [ 9 ] team was in average the top ranked and it used a feature selection and term weighting strategies that allowed them to achieve very good results. 2

Proposal for MexA3T 2019

Our main contribution is focused in determining a reduced subset of features that represent frequent lexical words for each pro le of Mexican tweets writers. The new subset of features was obtained considering the frequency of words in a pro le (e.g.: students), by using of, the theory of Transition Point [ 7 ]. All the objects are represented in this new feature space conformed by all the reduced subset in each class or pro le. The classi cation phase was performed using Support Vector Machines provided by the Weka [ 4 ] platform with default con guration. 2.1

Transition Point

The architecture for the dimensionality reduction of the vocabulary based on Transition Point Method is illustrated in the Figure 1. Transition Point (TP), refers to a frequency value in the vocabulary that delimit a frontier in which the terms of the vocabulary are relevant to the class and with high presence in objects of that class. It is based on the fundamentals studied and proposed by [ 11 ], who formulated the Law of word frequencies in a text, Zipf's Law. We rst build a vocabulary for each pro le (e.g., a vocabulary for male pro le and a vocabulary for female pro le) and each term of the vocabulary is associated with the frequency of occurrence in the tweets of its correspondence pro le. The TP is calculated for each vocabulary pro le (Vp) and using this, it is selected a percentage of tokens with frequency close to the value of TP. The new vocabulary for a pro le class (Gender Pro le) is formed by the union of the tokens present in the reduced vocabulary obtained for each pro le. 2.2

Tweet representation

The pro les are conformed by several tweets written by users. We consider a tweet as a document and represent the tweet by the tokens extracted using a Natural Language Processing Tools (NLPt). We used the FreeLing [ 10 ] NLPt and executed a rst representation based on the tokens extracted by the tokenizer. A second representation was built considering the lemmas of the tokens. In each of these representations, the features are weighted by a normalized frequency of occurrence. 2.3

Machine Learning Method

The supervised classi cation phase is done using SVM implemented in Weka platform with the default parameters. An user pro le is conformed by all the tweets written by him, and afterwards each tweet is represented in the new reduced vocabulary, it is conformed a prototype formed by a centroid of all the tweets. 3

Evaluation, Results and Discussion

The dataset distributed contains pro les for three classes: Gender, Location and Occupation [ 1 ] and the di erence with respect to MexA3T 2018 task is the Gender class. Particularly, the Gender dataset is balanced for each class, female and male, but the Location and Occupation dataset is unbalanced. The evaluation was made using F-measure by class, accuracy and F-average in a pro le.

The row CerpamidUA-Gender-Text-run1 used as vocabulary the extraction of 1 percent of tokens from the vocabulary of each class and the representation based on words extracted by a tokenizer. The row CerpamidUA-Gender-Textrun2 considered 10 percent of tokens and the representation based on lemmas. In Table 1, is illustrated the results obtained for gender classi cation.

Team F(P,R) Acc P R CerpamidUA-Gender-Text-run2 0.83 0.83 0.84 0.83 CerpamidUA-Gender-Text-run1 0.83 0.83 0.83 0.83 CIC-VCR-Secondary-Gender-Image 0.52 0.52 0.52 0.52

CIC-VCR-Gender-Image 0.47 0.48 0.48 0.48

The results obtained by run2 are similar than those of run1. In general the results are good, due to the balanced scenarios in both classes male and female. It is also important to notice that the representation based on lemma has less dimension than the representation based on tokens and the proposal to obtain a new vocabulary considering the TP, reduced the dimension dramaticaly obtaining good results.

In Table 2, is illustrated the result obtained for Location classi cation.

The results for Location classi cation are not high. The results are modest , we suppose that this drop, can be caused by the unbalance of the datasets. The majority classes get the best results, but the classes with few pro les achieved worse values. The accuracy values re ect that the majority class classi es very good its objects. The main problem is related to the vocabulary constructed, because the class with few objects contributes less with new tokens corresponding to it.

In Table 3, is illustrated the results obtained for Occupation classi cation, and the analysis of the results re ects similar conclusions than those explained for Location classi cation. In class with few document the results were low, determined by the scarce variety of the words of these classes in the vocabulary generated using TP. It was obtained very good results in the identi cation of gender, conditioned by the balance between classes. The weight of the features should be evaluated considering the di erence between dictionaries per class and the importance of each word in the new reduced vocabulary.

1. Aragon , M.E. , Alvarez-Carmona , M.A. , Montes-y Gomez , M. , Escalante , H.J. , Villasen~or- Pineda , L. , Moctezuma , D. : Overview of mex-a3t at iberlef 2019: Authorship and aggressiveness analysis in mexican spanish tweets . In: Notebook Papers of 1st SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF) , Bilbao, Spain, September ( 2019 )

2. Aragon , M.E. , Lopez-Monroy , A.P. : Author pro ling and aggressiveness detection in spanish tweets: Mex-a3t 2018 . In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018 ) co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2018 ), Sevilla, Spain, September 18th , 2018 . pp. 134 { 139 ( 2018 ), http://ceur-ws. org/ Vol-2150/MEX-A3T paper7 .pdf

Angel

Alvarez Carmona , M. , Guzman-Falcon , E., y Gomez, M.M. , Escalante , H.J., nor Pineda , L.V. , Reyes-Meza , V. , Sulayes , A.R. : Overview of mex-a3t at ibereval 2018: Authorship and aggressiveness analysis in mexican spanish tweets . In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018 ) co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2018 ), Sevilla, Spain, September 18th , 2018 . pp. 74 { 96 ( 2018 ), http://ceur-ws. org/ Vol- 2150 /overviewmex-a3t.pdf

Eibe

Frank , M.A.H. , Witten , I.H. : The weka workbench. online appendix for "data mining: Practical machine learning tools and techniques" ( 2016 )

5. Francisco Manuel, R.P., y Gomez, M.M. , Potthast , M. , Stein , B. : Overview of the 6th author pro ling task at pan 2018: Cross-domain authorship attribution and style change detection . In: Cappellato, L. , Ferro , N. , Nie , J.Y. , Soulier , L . (eds.) CLEF 2018 Evaluation Labs and Workshop { Working Notes Papers, 10 - 14 September, Avignon, France. CEUR-WS.org (sep 2018 ), http://ceur-ws.org/Vol2125/

6. Gra , M. , Miranda-Jimenez , S. , Tellez , E.S. , Moctezuma , D. , Salgado , V. , OrtizBejar , J., Sanchez , C.N. : Ingeotec at mex-a3t: Author pro ling and aggressiveness analysis in twitter using tc and evomsa . In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018 ) co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2018 ), Sevilla, Spain, September 18th , 2018 . pp. 128 { 133 ( 2018 ), http://ceur-ws. org/ Vol-2150/MEX-A3T paper6 .pdf

7. Jimenez-Salazar , H. , Pinto , D. , Rosso , P. : Uso del punto de transicion en la seleccion de terminos ndice para agrupamiento de textos cortos . Procesamiento del Lenguaje Natural 35 ( 2005 ), http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/2991/1485

8. Markov , I. , Gomez-Adorno , H. , Rosales , M.J. , Sidorov , G.: Cic-gil approach to author pro ling in spanish tweets: Location and occupation . In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018 ) co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2018 ), Sevilla, Spain, September 18th , 2018 . pp. 97 { 101 ( 2018 ), http://ceur-ws. org/ Vol-2150/MEX-A3T paper1 .pdf

9. Ortega-Mendoza , R.M. , Lopez-Monroy , A.P.: The winning approach for author pro ling of mexican users in twitter at mex . a3t@ibereval-2018. In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018 ) co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2018 ), Sevilla, Spain, September 18th , 2018 . pp. 140 { 148 ( 2018 ), http://ceur-ws. org/ Vol-2150/MEX-A3T paper8 .pdf

10. Padro , L. , Stanilovsky , E.: Freeling 3.0: Towards wider multilinguality . In: Proceedings of the Eighth International Conference on Language Resources and Evaluation , LREC 2012 , Istanbul, Turkey, May 23 -25, 2012 . pp. 2473 { 2479 ( 2012 ), http://www.lrec-conf.org/proceedings/lrec2012/summaries/430.html

11. Zipf , G.K. : Human behaviour and the principle of least e ort . Addison-Wesley ( 1949 )