=Paper=
{{Paper
|id=Vol-2421/MEX-A3T_paper_2
|storemode=property
|title=CerpamidUA at MexA3T 2019: Transition Point Proposal
|pdfUrl=https://ceur-ws.org/Vol-2421/MEX-A3T_paper_2.pdf
|volume=Vol-2421
|authors=Daniel Castro Castro,María Fernanda Artigas Herold,Reynier Ortega Bueno,Rafael Muñoz
|dblpUrl=https://dblp.org/rec/conf/sepln/Castro-CastroHB19
}}
==CerpamidUA at MexA3T 2019: Transition Point Proposal==
<pdf width="1500px">https://ceur-ws.org/Vol-2421/MEX-A3T_paper_2.pdf</pdf>
<pre>
CerpamidUA at MexA3T 2019: Transition Point
                Proposal

 Daniel Castro Castro1[0000−0001−9102−7601] , Marı́a Fernanda Artigas Herold2 ,
                Reynier Ortega Bueno1 , and Rafael Muñoz3
               1
               Center for Pattern Recognition and Data Mining, Cuba
               {daniel.castro, reynier.ortega}@cerpamid.co.cu
                            http://www.cerpamid.co.cu
                             2
                               Oriente University, Cuba
                      maria.artigas@estudiantes.uo.edu.cu
     3
       Department of Software and Computing systems, Alicante University, Spain
                                rafael@dlsi.ua.es
                               http://www.dlsi.ua.es


        Abstract. Author Profiling is an important field for detection of demo-
        graphic characteristics of users based on texts written by him. Our main
        contribution is focused in determining a reduced subset of features that
        represent frequent lexical words for each profile of Mexican twitters. The
        new subset of features was obtained considering the frequency of words
        in a profile (e.g.: students), employing the theory of Transition Points.
        All the objects are represented in this new feature space conformed by
        all the reduced subset computed for each class or profile. The classifi-
        cation phase was carried out using Support Vector Machines provided
        by the Weka platform. The results obtained were good for Gender, but
        needs more efforts for Location and Occupation, because, the main fac-
        tor that affects the results correspond to scenarios with unbalanced class
        distribution that impact the construction of the reduced vocabulary.

        Keywords: Author Profiling · Transition Point · Mexican Twitter Pro-
        filing.


1     Introduction
The modern society is characterized by an impressive use of digital technology
and in particular to socialize using Social Network platforms in which emotions,
ideas, new information, etc, are expressed. Users share their information using
image, text, videos and other resources. All the available public information of
an user, and in particular text and image, could be used to determine demo-
graphic attributes of him, such as, gender, age, personality, level of scholarship
and others, and this is the key question in study in the field of Author Profiling
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 Septem-
    ber 2019, Bilbao, Spain.
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


(AP) analysis.
In 2018, it was proposed the MexA3T task for Author Profiling and Aggres-
siveness analysis focused on Mexican tweets [3]. The AP task comprises the
detection of Place of Residence and Occupation of an user profile based on the
set of tweets written by him. As it was exposed in the overview [3], it was a
challenging task and for that reason they relaunch a similar task; including the
analysis of Gender characteristics.
An important difference of this year [1] with respect to the previous task is that
an user profile is distributed not only using the text of the tweets, but also im-
ages were incorporated on the profiles. This will allow the use of Text and Image
for profiling classification and it is not necessary to use both information.
The principal evaluation Forum for Authorship Analysis over several years has
been the PAN Lab at CLEF and in particular it has evaluated the AP [5] task
considering the identification of Gender, Personality, Age, etc.
In MexA3T 2018 AP task, participated 4 teams [9] [2] [6] [8], the majority of
them used an approach based on SVM classification and representation of text
employing as features n-grams of character and lexical tokens. The MXAA [9]
team was in average the top ranked and it used a feature selection and term
weighting strategies that allowed them to achieve very good results.


2     Proposal for MexA3T 2019


Our main contribution is focused in determining a reduced subset of features
that represent frequent lexical words for each profile of Mexican tweets writers.
The new subset of features was obtained considering the frequency of words
in a profile (e.g.: students), by using of, the theory of Transition Point [7]. All
the objects are represented in this new feature space conformed by all the re-
duced subset in each class or profile. The classification phase was performed
using Support Vector Machines provided by the Weka [4] platform with default
configuration.


2.1   Transition Point


The architecture for the dimensionality reduction of the vocabulary based on
Transition Point Method is illustrated in the Figure 1.


                                          503
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


      Fig. 1. Architecture for dimensionality reduction using Transition Point


    Transition Point (TP), refers to a frequency value in the vocabulary that
delimit a frontier in which the terms of the vocabulary are relevant to the class
and with high presence in objects of that class. It is based on the fundamentals
studied and proposed by [11], who formulated the Law of word frequencies in
a text, Zipf’s Law. We first build a vocabulary for each profile (e.g., a vocabu-
lary for male profile and a vocabulary for female profile) and each term of the
vocabulary is associated with the frequency of occurrence in the tweets of its
correspondence profile. The TP is calculated for each vocabulary profile (Vp )
and using this, it is selected a percentage of tokens with frequency close to the
value of TP. The new vocabulary for a profile class (Gender Profile) is formed
by the union of the tokens present in the reduced vocabulary obtained for each
profile.

2.2   Tweet representation
The profiles are conformed by several tweets written by users. We consider a
tweet as a document and represent the tweet by the tokens extracted using a
Natural Language Processing Tools (NLPt). We used the FreeLing [10] NLPt and
executed a first representation based on the tokens extracted by the tokenizer.
A second representation was built considering the lemmas of the tokens. In each
of these representations, the features are weighted by a normalized frequency of
occurrence.


                                          504
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


2.3   Machine Learning Method
The supervised classification phase is done using SVM implemented in Weka
platform with the default parameters. An user profile is conformed by all the
tweets written by him, and afterwards each tweet is represented in the new
reduced vocabulary, it is conformed a prototype formed by a centroid of all the
tweets.

3     Evaluation, Results and Discussion
The dataset distributed contains profiles for three classes: Gender, Location and
Occupation [1] and the difference with respect to MexA3T 2018 task is the
Gender class. Particularly, the Gender dataset is balanced for each class, female
and male, but the Location and Occupation dataset is unbalanced.
The evaluation was made using F-measure by class, accuracy and F-average in
a profile.
The row CerpamidUA-Gender-Text-run1 used as vocabulary the extraction of
1 percent of tokens from the vocabulary of each class and the representation
based on words extracted by a tokenizer. The row CerpamidUA-Gender-Text-
run2 considered 10 percent of tokens and the representation based on lemmas.
In Table 1, is illustrated the results obtained for gender classification.


                              Table 1. Gender results.

             Team                           F(P,R) Acc P R
             CerpamidUA-Gender-Text-run2    0.83 0.83 0.84 0.83
             CerpamidUA-Gender-Text-run1    0.83   0.83 0.83 0.83
             CIC-VCR-Secondary-Gender-Image 0.52   0.52 0.52 0.52
             CIC-VCR-Gender-Image           0.47   0.48 0.48 0.48


    The results obtained by run2 are similar than those of run1. In general the
results are good, due to the balanced scenarios in both classes male and female.
It is also important to notice that the representation based on lemma has less
dimension than the representation based on tokens and the proposal to obtain a
new vocabulary considering the TP, reduced the dimension dramaticaly obtain-
ing good results.
In Table 2, is illustrated the result obtained for Location classification.


                              Table 2. Location results.

Team                           F(P,R) Acc center southeast northwest north northeast west
CerpamidUA-Location-run2       0.50   0.63 0.68 0.38       0.66      0.16 0.70       0.28
CerpamidUA-Location-run1       0.48   0.61 0.70 0.39       0.66      0.25 0.72       0.26
CIC-VCR-Secondary-Gender-Image 0.14   0.23 0.41 0.04       0.09      0.02 0.20       0.10
CIC-VCR-Gender-Image           0.10   0.16 0.36 0.00       0.08      0.04 0.13       0.02


                                          505
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


    The results for Location classification are not high. The results are modest ,
we suppose that this drop, can be caused by the unbalance of the datasets. The
majority classes get the best results, but the classes with few profiles achieved
worse values. The accuracy values reflect that the majority class classifies very
good its objects. The main problem is related to the vocabulary constructed,
because the class with few objects contributes less with new tokens corresponding
to it.
In Table 3, is illustrated the results obtained for Occupation classification, and
the analysis of the results reflects similar conclusions than those explained for
Location classification.

                            Table 3. Occupation results.

Team                           F(P,R) Acc others arts student social sciences sports admin health
CerpamidUA-Occupation-run2     0.39   0.65 0.10 0.25 0.85     0.56 0.19       0.30 0.51 0.25
CerpamidUA-Occupation-run1     0.38   0.66 0.13 0.33 0.86     0.55 0.20       0.35 0.47 0.24
CIC-VCR-Secondary-Gender-Image 0.11   0.26 0.00 0.09 0.44     0.13 0.06       0.00 0.21 0.00
CIC-VCR-Gender-Image           0.09   0.23 0.00 0.11 0.43     0.07 0.04       0.00 0.09 0.00


4   Conclusion and Future Work
In class with few document the results were low, determined by the scarce va-
riety of the words of these classes in the vocabulary generated using TP. It was
obtained very good results in the identification of gender, conditioned by the
balance between classes. The weight of the features should be evaluated consid-
ering the difference between dictionaries per class and the importance of each
word in the new reduced vocabulary.


References
 1. Aragón, M.E., Álvarez-Carmona, M.Á., Montes-y Gómez, M., Escalante, H.J., Vil-
    laseñor-Pineda, L., Moctezuma, D.: Overview of mex-a3t at iberlef 2019: Author-
    ship and aggressiveness analysis in mexican spanish tweets. In: Notebook Papers of
    1st SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Bilbao,
    Spain, September (2019)
 2. Aragón, M.E., López-Monroy, A.P.: Author profiling and aggressiveness detection
    in spanish tweets: Mex-a3t 2018. In: Proceedings of the Third Workshop on Eval-
    uation of Human Language Technologies for Iberian Languages (IberEval 2018)
    co-located with 34th Conference of the Spanish Society for Natural Language Pro-
    cessing (SEPLN 2018), Sevilla, Spain, September 18th, 2018. pp. 134–139 (2018),
    http://ceur-ws.org/Vol-2150/MEX-A3T paper7.pdf
 3. Ángel Álvarez Carmona, M., Guzmán-Falcón, E., y Gómez, M.M., Escalante, H.J.,
    nor Pineda, L.V., Reyes-Meza, V., Sulayes, A.R.: Overview of mex-a3t at ibereval
    2018: Authorship and aggressiveness analysis in mexican spanish tweets. In: Pro-
    ceedings of the Third Workshop on Evaluation of Human Language Technolo-
    gies for Iberian Languages (IberEval 2018) co-located with 34th Conference of the


                                          506
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


    Spanish Society for Natural Language Processing (SEPLN 2018), Sevilla, Spain,
    September 18th, 2018. pp. 74–96 (2018), http://ceur-ws.org/Vol-2150/overview-
    mex-a3t.pdf
 4. Eibe Frank, M.A.H., Witten, I.H.: The weka workbench. online appendix for ”data
    mining: Practical machine learning tools and techniques” (2016)
 5. Francisco Manuel, R.P., y Gómez, M.M., Potthast, M., Stein, B.: Overview of
    the 6th author profiling task at pan 2018: Cross-domain authorship attribution
    and style change detection. In: Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L.
    (eds.) CLEF 2018 Evaluation Labs and Workshop – Working Notes Papers, 10-14
    September, Avignon, France. CEUR-WS.org (sep 2018), http://ceur-ws.org/Vol-
    2125/
 6. Graff, M., Miranda-Jiménez, S., Tellez, E.S., Moctezuma, D., Salgado, V., Ortiz-
    Bejar, J., Sánchez, C.N.: Ingeotec at mex-a3t: Author profiling and aggressiveness
    analysis in twitter using µtc and evomsa. In: Proceedings of the Third Workshop
    on Evaluation of Human Language Technologies for Iberian Languages (IberEval
    2018) co-located with 34th Conference of the Spanish Society for Natural Lan-
    guage Processing (SEPLN 2018), Sevilla, Spain, September 18th, 2018. pp. 128–133
    (2018), http://ceur-ws.org/Vol-2150/MEX-A3T paper6.pdf
 7. Jiménez-Salazar, H., Pinto, D., Rosso, P.: Uso del punto de tran-
    sición en la selección de términos ı́ndice para agrupamiento de
    textos     cortos.   Procesamiento      del  Lenguaje     Natural     35    (2005),
    http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/2991/1485
 8. Markov, I., Gómez-Adorno, H., Rosales, M.J., Sidorov, G.: Cic-gil approach to
    author profiling in spanish tweets: Location and occupation. In: Proceedings of
    the Third Workshop on Evaluation of Human Language Technologies for Iberian
    Languages (IberEval 2018) co-located with 34th Conference of the Spanish Society
    for Natural Language Processing (SEPLN 2018), Sevilla, Spain, September 18th,
    2018. pp. 97–101 (2018), http://ceur-ws.org/Vol-2150/MEX-A3T paper1.pdf
 9. Ortega-Mendoza, R.M., López-Monroy, A.P.: The winning approach for author
    profiling of mexican users in twitter at mex.a3t@ibereval-2018. In: Proceedings of
    the Third Workshop on Evaluation of Human Language Technologies for Iberian
    Languages (IberEval 2018) co-located with 34th Conference of the Spanish Society
    for Natural Language Processing (SEPLN 2018), Sevilla, Spain, September 18th,
    2018. pp. 140–148 (2018), http://ceur-ws.org/Vol-2150/MEX-A3T paper8.pdf
10. Padró, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In: Pro-
    ceedings of the Eighth International Conference on Language Resources and Eval-
    uation, LREC 2012, Istanbul, Turkey, May 23-25, 2012. pp. 2473–2479 (2012),
    http://www.lrec-conf.org/proceedings/lrec2012/summaries/430.html
11. Zipf, G.K.: Human behaviour and the principle of least effort. Addison-Wesley
    (1949)


                                          507

</pre>