<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CerpamidUA at MexA3T 2019: Transition Point Proposal</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Castro Castro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mar a Fernanda Artigas Herold</string-name>
          <email>maria.artigas@estudiantes.uo.edu.cu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reynier Ortega Bueno</string-name>
          <email>reynier.ortegag@cerpamid.co.cu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael Mun~oz</string-name>
          <email>rafael@dlsi.ua.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Pattern Recognition and Data Mining</institution>
          ,
          <country country="CU">Cuba</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Software and Computing systems, Alicante University</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Oriente University</institution>
          ,
          <country country="CU">Cuba</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>502</fpage>
      <lpage>507</lpage>
      <abstract>
        <p>Author Pro ling is an important eld for detection of demographic characteristics of users based on texts written by him. Our main contribution is focused in determining a reduced subset of features that represent frequent lexical words for each pro le of Mexican twitters. The new subset of features was obtained considering the frequency of words in a pro le (e.g.: students), employing the theory of Transition Points. All the objects are represented in this new feature space conformed by all the reduced subset computed for each class or pro le. The classi cation phase was carried out using Support Vector Machines provided by the Weka platform. The results obtained were good for Gender, but needs more e orts for Location and Occupation, because, the main factor that a ects the results correspond to scenarios with unbalanced class distribution that impact the construction of the reduced vocabulary.</p>
      </abstract>
      <kwd-group>
        <kwd>Author Pro ling Transition Point Mexican Twitter Proling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The modern society is characterized by an impressive use of digital technology
and in particular to socialize using Social Network platforms in which emotions,
ideas, new information, etc, are expressed. Users share their information using
image, text, videos and other resources. All the available public information of
an user, and in particular text and image, could be used to determine
demographic attributes of him, such as, gender, age, personality, level of scholarship
and others, and this is the key question in study in the eld of Author Pro ling
(AP) analysis.</p>
      <p>
        In 2018, it was proposed the MexA3T task for Author Pro ling and
Aggressiveness analysis focused on Mexican tweets [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The AP task comprises the
detection of Place of Residence and Occupation of an user pro le based on the
set of tweets written by him. As it was exposed in the overview [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], it was a
challenging task and for that reason they relaunch a similar task; including the
analysis of Gender characteristics.
      </p>
      <p>
        An important di erence of this year [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] with respect to the previous task is that
an user pro le is distributed not only using the text of the tweets, but also
images were incorporated on the pro les. This will allow the use of Text and Image
for pro ling classi cation and it is not necessary to use both information.
The principal evaluation Forum for Authorship Analysis over several years has
been the PAN Lab at CLEF and in particular it has evaluated the AP [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] task
considering the identi cation of Gender, Personality, Age, etc.
      </p>
      <p>
        In MexA3T 2018 AP task, participated 4 teams [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the majority of
them used an approach based on SVM classi cation and representation of text
employing as features n-grams of character and lexical tokens. The MXAA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
team was in average the top ranked and it used a feature selection and term
weighting strategies that allowed them to achieve very good results.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Proposal for MexA3T 2019</title>
      <p>
        Our main contribution is focused in determining a reduced subset of features
that represent frequent lexical words for each pro le of Mexican tweets writers.
The new subset of features was obtained considering the frequency of words
in a pro le (e.g.: students), by using of, the theory of Transition Point [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. All
the objects are represented in this new feature space conformed by all the
reduced subset in each class or pro le. The classi cation phase was performed
using Support Vector Machines provided by the Weka [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] platform with default
con guration.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Transition Point</title>
        <p>
          The architecture for the dimensionality reduction of the vocabulary based on
Transition Point Method is illustrated in the Figure 1.
Transition Point (TP), refers to a frequency value in the vocabulary that
delimit a frontier in which the terms of the vocabulary are relevant to the class
and with high presence in objects of that class. It is based on the fundamentals
studied and proposed by [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], who formulated the Law of word frequencies in
a text, Zipf's Law. We rst build a vocabulary for each pro le (e.g., a
vocabulary for male pro le and a vocabulary for female pro le) and each term of the
vocabulary is associated with the frequency of occurrence in the tweets of its
correspondence pro le. The TP is calculated for each vocabulary pro le (Vp)
and using this, it is selected a percentage of tokens with frequency close to the
value of TP. The new vocabulary for a pro le class (Gender Pro le) is formed
by the union of the tokens present in the reduced vocabulary obtained for each
pro le.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Tweet representation</title>
        <p>
          The pro les are conformed by several tweets written by users. We consider a
tweet as a document and represent the tweet by the tokens extracted using a
Natural Language Processing Tools (NLPt). We used the FreeLing [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] NLPt and
executed a rst representation based on the tokens extracted by the tokenizer.
A second representation was built considering the lemmas of the tokens. In each
of these representations, the features are weighted by a normalized frequency of
occurrence.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Machine Learning Method</title>
        <p>The supervised classi cation phase is done using SVM implemented in Weka
platform with the default parameters. An user pro le is conformed by all the
tweets written by him, and afterwards each tweet is represented in the new
reduced vocabulary, it is conformed a prototype formed by a centroid of all the
tweets.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation, Results and Discussion</title>
      <p>
        The dataset distributed contains pro les for three classes: Gender, Location and
Occupation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the di erence with respect to MexA3T 2018 task is the
Gender class. Particularly, the Gender dataset is balanced for each class, female
and male, but the Location and Occupation dataset is unbalanced.
The evaluation was made using F-measure by class, accuracy and F-average in
a pro le.
      </p>
      <p>The row CerpamidUA-Gender-Text-run1 used as vocabulary the extraction of
1 percent of tokens from the vocabulary of each class and the representation
based on words extracted by a tokenizer. The row
CerpamidUA-Gender-Textrun2 considered 10 percent of tokens and the representation based on lemmas.
In Table 1, is illustrated the results obtained for gender classi cation.</p>
      <p>Team F(P,R) Acc P R
CerpamidUA-Gender-Text-run2 0.83 0.83 0.84 0.83
CerpamidUA-Gender-Text-run1 0.83 0.83 0.83 0.83
CIC-VCR-Secondary-Gender-Image 0.52 0.52 0.52 0.52</p>
      <p>CIC-VCR-Gender-Image 0.47 0.48 0.48 0.48</p>
      <p>The results obtained by run2 are similar than those of run1. In general the
results are good, due to the balanced scenarios in both classes male and female.
It is also important to notice that the representation based on lemma has less
dimension than the representation based on tokens and the proposal to obtain a
new vocabulary considering the TP, reduced the dimension dramaticaly
obtaining good results.</p>
      <p>In Table 2, is illustrated the result obtained for Location classi cation.</p>
      <p>The results for Location classi cation are not high. The results are modest ,
we suppose that this drop, can be caused by the unbalance of the datasets. The
majority classes get the best results, but the classes with few pro les achieved
worse values. The accuracy values re ect that the majority class classi es very
good its objects. The main problem is related to the vocabulary constructed,
because the class with few objects contributes less with new tokens corresponding
to it.</p>
      <p>In Table 3, is illustrated the results obtained for Occupation classi cation, and
the analysis of the results re ects similar conclusions than those explained for
Location classi cation.
In class with few document the results were low, determined by the scarce
variety of the words of these classes in the vocabulary generated using TP. It was
obtained very good results in the identi cation of gender, conditioned by the
balance between classes. The weight of the features should be evaluated
considering the di erence between dictionaries per class and the importance of each
word in the new reduced vocabulary.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aragon</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alvarez-Carmona</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gomez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <article-title>Villasen~or-</article-title>
          <string-name>
            <surname>Pineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moctezuma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Overview of mex-a3t at iberlef 2019: Authorship and aggressiveness analysis in mexican spanish tweets</article-title>
          .
          <source>In: Notebook Papers of 1st SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF)</source>
          , Bilbao, Spain, September (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aragon</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez-Monroy</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          :
          <article-title>Author pro ling and aggressiveness detection in spanish tweets: Mex-a3t 2018</article-title>
          .
          <source>In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval</source>
          <year>2018</year>
          )
          <article-title>co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2018</year>
          ), Sevilla, Spain,
          <year>September 18th</year>
          ,
          <year>2018</year>
          . pp.
          <volume>134</volume>
          {
          <issue>139</issue>
          (
          <year>2018</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-2150/MEX-A3T
          <year>paper7</year>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Angel</given-names>
            <surname>Alvarez Carmona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Guzman-Falcon</surname>
          </string-name>
          , E., y Gomez,
          <string-name>
            <given-names>M.M.</given-names>
            ,
            <surname>Escalante</surname>
          </string-name>
          , H.J., nor
          <string-name>
            <surname>Pineda</surname>
            ,
            <given-names>L.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reyes-Meza</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sulayes</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          :
          <article-title>Overview of mex-a3t at ibereval 2018: Authorship and aggressiveness analysis in mexican spanish tweets</article-title>
          .
          <source>In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval</source>
          <year>2018</year>
          )
          <article-title>co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2018</year>
          ), Sevilla, Spain,
          <year>September 18th</year>
          ,
          <year>2018</year>
          . pp.
          <volume>74</volume>
          {
          <issue>96</issue>
          (
          <year>2018</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2150</volume>
          /overviewmex-a3t.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Eibe</given-names>
            <surname>Frank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.A.H.</given-names>
            ,
            <surname>Witten</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.H.</surname>
          </string-name>
          :
          <article-title>The weka workbench. online appendix for "data mining: Practical machine learning tools</article-title>
          <source>and techniques"</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Francisco Manuel, R.P., y Gomez,
          <string-name>
            <given-names>M.M.</given-names>
            ,
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Overview of the 6th author pro ling task at pan 2018: Cross-domain authorship attribution and style change detection</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>CLEF 2018 Evaluation Labs</article-title>
          and Workshop { Working Notes Papers,
          <volume>10</volume>
          -
          <fpage>14</fpage>
          September, Avignon, France. CEUR-WS.org (sep
          <year>2018</year>
          ), http://ceur-ws.org/Vol2125/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gra</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranda-Jimenez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tellez</surname>
            ,
            <given-names>E.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moctezuma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salgado</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>OrtizBejar</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Sanchez</surname>
            ,
            <given-names>C.N.</given-names>
          </string-name>
          :
          <article-title>Ingeotec at mex-a3t: Author pro ling and aggressiveness analysis in twitter using tc and evomsa</article-title>
          .
          <source>In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval</source>
          <year>2018</year>
          )
          <article-title>co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2018</year>
          ), Sevilla, Spain,
          <year>September 18th</year>
          ,
          <year>2018</year>
          . pp.
          <volume>128</volume>
          {
          <issue>133</issue>
          (
          <year>2018</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-2150/MEX-A3T
          <year>paper6</year>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Jimenez-Salazar</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Uso del punto de transicion en la seleccion de terminos ndice para agrupamiento de textos cortos</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>35</volume>
          (
          <year>2005</year>
          ), http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/2991/1485
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosales</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
          </string-name>
          , G.:
          <article-title>Cic-gil approach to author pro ling in spanish tweets: Location and occupation</article-title>
          .
          <source>In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval</source>
          <year>2018</year>
          )
          <article-title>co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2018</year>
          ), Sevilla, Spain,
          <year>September 18th</year>
          ,
          <year>2018</year>
          . pp.
          <volume>97</volume>
          {
          <issue>101</issue>
          (
          <year>2018</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-2150/MEX-A3T
          <year>paper1</year>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ortega-Mendoza</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez-Monroy</surname>
            ,
            <given-names>A.P.:</given-names>
          </string-name>
          <article-title>The winning approach for author pro ling of mexican users in twitter at mex</article-title>
          .
          <source>a3t@ibereval-2018. In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval</source>
          <year>2018</year>
          )
          <article-title>co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2018</year>
          ), Sevilla, Spain,
          <year>September 18th</year>
          ,
          <year>2018</year>
          . pp.
          <volume>140</volume>
          {
          <issue>148</issue>
          (
          <year>2018</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-2150/MEX-A3T
          <year>paper8</year>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Padro</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanilovsky</surname>
          </string-name>
          , E.:
          <article-title>Freeling 3.0: Towards wider multilinguality</article-title>
          .
          <source>In: Proceedings of the Eighth International Conference on Language Resources and Evaluation</source>
          ,
          <string-name>
            <surname>LREC</surname>
          </string-name>
          <year>2012</year>
          , Istanbul, Turkey, May
          <volume>23</volume>
          -25,
          <year>2012</year>
          . pp.
          <volume>2473</volume>
          {
          <issue>2479</issue>
          (
          <year>2012</year>
          ), http://www.lrec-conf.org/proceedings/lrec2012/summaries/430.html
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Zipf</surname>
            ,
            <given-names>G.K.</given-names>
          </string-name>
          :
          <article-title>Human behaviour and the principle of least e ort</article-title>
          .
          <source>Addison-Wesley</source>
          (
          <year>1949</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>