<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Author Profiling, instance-based Similarity Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>yaritza.adame@datys.cu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>reynier.ortega</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>daniel.castro}@cerpamid.co.cu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante</institution>
          ,
          <country>España</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Desarrollo de Aplicaciones</institution>
          ,
          <addr-line>Tecnología y Sistemas DATYS</addr-line>
          ,
          <country country="CU">Cuba</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Yaritza Adame-Arcia</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>English (Australia</institution>
          ,
          <country country="CA">Canada</country>
          ,
          <addr-line>Great Britain</addr-line>
          ,
          <country country="IE">Ireland</country>
          ,
          <addr-line>New Zealand, United States)  Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela)  Portuguese (Brazil, Portugal)  Arabic, Egypt, Gulf, Levantine, Maghrebi</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>In digital documents analysis for forensic applications, when anonymous documents are presented and it is not possible with the available tools to determine the true author of the document, there are of vital importance methods that identify the characteristics of the Author Profile (Gender, Age, Personality, etc.). We propose to use a simple method of classification based on the similarity between objects, considering different features for documents representation: (a document corresponds to a set of tweets of a user), the terms used in the tweets, as well as characteristics of opinion and subjectivity presented in them. Our goal will be to classify, based on the content of the tweets, the Gender and language variety of an author from an unknown set of tweets corresponding to him. In the experiments we observed good results in Gender classification, but low values in language variety classification. We processed only the English dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>Author profiling</kwd>
        <kwd>instance-based classification</kwd>
        <kwd>tweets gender classify</kwd>
        <kwd>tweets language variety classify</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The PAN Profiling task for this edition is as follows: "Gender and language variety
identification in Twitter. Demographics traits such as gender and language have so far
investigated separately. In this task we will have participants with a corpus annotated
with authors' gender and their specific variation of their native language:
Although we suggest to participate in both subtasks (gender and language
identification) and in all languages, it is possible participating only in one of them and in some
of the languages.”</p>
      <p>The proposal to identify these demographic traits in tweets implies that the natural
language processing tools widely used for long documents analysis must be adapted to
the features of the textual genre and the writing characteristics presented in tweets. We
must emphasize that the complexity lies in the fact that for this genre there are no
linguistic rules or writing standards. Language is informal, usually direct and full of
emotions.</p>
      <p>
        In past tasks of demographic traits identification on PAN evaluation framework [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], tweet genre was used and many works presented used lexical content (words,
informal text, jargon) and characteristic features of the genre (URL, hashtags, mentions,
retweet, emoticons, etc.). The generality of the proposals uses the classic Bag of Words
representation of documents, employing in addition to the mentioned features, n-grams
of some of them, for example, words n-grams, lemmas n-grams, POS-Tagging (Part of
Speech Grammatical Categories) n-grams, etc. The fundamental difference of the
proposal of this year to previous proposals, lies in evaluating and classifying by variety of
the language.
      </p>
      <p>For the classification process, decision tree-based approximations have been used,
as well as SVM by a large number of competitors and a few others have used
distancebased approximations to predict the closest class [14][15].</p>
      <p>
        We are interested in implementing a distance-based classification strategy and with
this, use previous results presented in the Author Identification edition of 2015 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We
will combine features of the lexical content of the tweets, their characteristic features,
and polarity and emotion features of previous works of our group used in tasks of
sentence polarity classification. We will experimentally evaluate the differences between
an instance-based proposal and a prototype-based proposal, in the same distance-based
strategy.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Implemented methods</title>
      <p>We used two classification strategies, considering two documents representation
variants. An instance based representation of the documents, where the set of tweets of an
author (for each author it is available her/his gender and language variety) represents a
document and with this idea, for each class (female class, male class) we have a set of
documents. The second variant is a prototype-based representation, where a single
document is formed for each class, and this document is constructed with all the tweets of
each of the sample authors per class.</p>
      <p>Figure 1 shows graphically the architecture of our proposal with the instance-based
strategy.</p>
      <sec id="sec-2-1">
        <title>Features and tweets pre-processing stage:</title>
        <p>The first step correspond to build the documents that will be used as objects for the
similarity calculation in the classification method. For each author, we receive the set
of tweets that she/he wrote, and with the concatenation of these tweets is formed a
document for this author. Remember that, of each author, what we have is the gender
and the language variety. We perform a pre-processing of the document in two stages.
In a first stage, we segment the tweets with a tokenizer offered in FreeLing [13]
[http://nlp.lsi.upc.edu/freeling/], specialized for the processing of tweets. Subsequently
we proceed to the expansion of short terms used and contractions, and characteristics
traits that are used in tweets such as the Hashtags, URLs, mentions, are replaced by
certain fixed patterns, those traits we consider the content does not contribute to
differentiate between tweets of different profiles. After these transformations, we have
normalized the tweets a bit and next proceed to perform a syntactic analysis with the
traditional POS-Tagging tools for English and Spanish according to the language of the
tweets.</p>
        <p>
          For the representation we use the classic Bag of Words and in this we integrate:
 The lexical terms, the lemmas of these and the grammatical category.
 Characteristics features of the tweets.
 Features of subjectivity and opinion mining analysis [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>With the lexical terms and lemmas, we hope to differentiate the documents of each
class, because some of this features are proper of their class. For example, for language
variety, some terms are used by Colombians unlike the rest, and thus similarly for each
variant of Spanish. Considering the frequency of use of grammatical categories, would
allow us to differentiate between tweets written by the male gender and those written
by the female gender. For example, in [17] it is exposed several differences in the use
of words and different Part of Speech analyzing women and men writing style.</p>
        <p>The characteristic features of the tweets we extract correspond to hashtags, the
mention of author, the mark of retweet, the use of URL, the use of intensifications (capital
letters, deformation of words by repetition of characters, use of admiration signs), use
of laughter expressions, use of emoticons and the use of informal language. For each
of these traits we consider the position in which they are used, that is, the number of
times used at the beginning of the tweet, at the end or elsewhere.</p>
        <p>
          Additionally, we include the analysis of the frequency of features with subjective
information, for example, the number of positive or negative emoticons; the words used
were categorized as Positive (P), High Positive (HP), Negative (N) and High Negative
(HN), using the frequencies of this categories. We used a word polarity resource in
Spanish and English taken from [12], resources of emotion in Spanish [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] [11] and for
English [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and finally the resources of appraisal for Spanish [9] [10] and English [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Classification stage:</title>
        <p>For the classification of the set of tweets of an author in the Demographic traits of
gender and language variety, we tried with two strategies. A strategy in which each
document (set of author tweets) is used as an instance of the class to which it belongs
and for the second strategy we construct a prototype of each class using the extracted
features of the set of documents belonging to the class. Each of these strategies were
evaluated with the tweets collections of the training set and was selected for the final
evaluation, the one that showed more stable results in different executions.</p>
        <p>
          In the instance-based strategy, it is calculated the similarity of the new document
with each sample document of the class, and then is computed the average similarity
obtained with the class. This analysis is done with each class of a Demographic Trait
and the object is going to belong to the class with which it obtains greater average
similarity. In the prototype-based strategy, the similarity of the new document is
calculated with the class prototype. This analysis is done for each class and the object is
going to belong to the class in which the similarity obtained was the highest (1-NN
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ][16]).
        </p>
        <p>The classification is done independently for each Author Demographic Trait, Gender
classes (2 classes) and language variety (for English 6 classes and for Spanish 7
classes). Finally, the result is the combination of these two classifications.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and results</title>
      <p>
        The initial experiments were performed with the training collection released for this
year's 2017 task. We evaluated the accuracy obtained by performing a 2-cross fold
validation. In addition, we considered the training collection of the 2015 edition for the
Gender and Age classes. The description of these collections can be reviewed in [18]
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In Table 1 we include the values obtained in the tests with the two representation
strategies, instance-based and prototype centroid-based one, using the collection of
2017. In table 2, we present the results with the collection of 2015.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and future work</title>
      <p>A representation that considers the terms used in tweets, is able to differentiate to a
large extent the sets of tweets written by authors of different genres. The proposed
subjectivity and opinion features allow improvements in classification, but they are not
substantial improvements. In the evaluation we made with the collections of 2015, we
verified that each of the sets of features separately allows good identifications of the
genre and that their combination increases the values obtained. The classification in
language variety maintains low results and to a great extent this is due to the little
difference that is observed between some of these classes and that many terms used by the
authors are of universal character and are standardized in the community.</p>
      <p>We achieved the lowest values of all the team and considering that a baseline method
using the 1000 most frequent terms in a Bag of Word representation got better results,
then we need to do an exhaustive evaluation of our method.</p>
      <p>We must work on features selection strategies and the analysis of representative
objects to each of the classes. We propose to evaluate a classification with rejection or
abstention for those users whose tweets do not contain characteristic features with their
class, for example for the idea of language and not penalize so much the possible bad
classifications.
5
9. Hernández, L., López-Lopez, A., &amp; Medina-Pagola, J. E. (2009). Recognizing Polarity and
Attitude of Words in Text. In In Proc. F 14th Portuguese Conference on Artificial
Intelligence, (EPIA’2009) (pp. 525–536). Aveiro, Portugal.
10. Hernández, L., López-Lopez, A., &amp; Pagola, J. E. M. (2011). Classification of Attitude Words
for Opinions Mining. International Journal of Computational Linguistics and Applications,
2(1–2), 267–283.
11. Ismael Díaz Rangel, Grigori Sidorov, Sergio Suárez-Guerra. Creación y evaluación de un
diccionario marcado con emociones y ponderado para el español. Onomazein , 29, 23 p.,
2014, DOI 10.7764/onomazein.29.5
12. Jose Manuel Yero Moreno, Reynier Ortega Bueno. Método no supervisado para la
clasificación de polaridad en Twitter. VII Conferencia Internacional de Ingeniería Eléctrica. . pp.
1 - 4. Jun, 2014. ISBN: 978-959-207-529-0.
13. Lluís Padró, Evgeny Stanilovsky. FreeLing 3.0: Towards Wider Multilinguality Proceedings
of the Language Resources and Evaluation Conference (LREC 2012) ELRA. Istanbul,
Turkey. May, 2012.
14. Mirco Kocher, Jacques Savoy: UniNE at CLEF 2016: Author Profiling. CLEF (Working</p>
      <p>Notes) 2016: 903-911
15. Maria José Garciarena Ucelay, Maria Paula Villegas, Dario G. Funez, Leticia C. Cagnina,
Marcelo Luis Errecalde, Gabriela Ramírez-de-la-Rosa, Esaú Villatoro-Tello: Profile-based
Approach for Age and Gender Identification. CLEF (Working Notes) 2016: 864-873
16. Patrick Juola. Authorship Attribution. In Foundations and Trends in Information Retrieval,</p>
      <p>Volume 1, Issue 3, March 2008.
17. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count: LIWC
2001. Mahway: Lawrence Erlbaum Associates 71 (2001)
18. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th Author Profiling Task at
PAN 2017: Gender and Language Variety Identification in Twitter. In: Working Notes
Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF and
CEURWS.org (Sep 2017)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bloom</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garg</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Extracting Appraisal Expressions</article-title>
          .
          <source>In Proceedings of NAACL HLT</source>
          <year>2007</year>
          .
          <article-title>Rochester, NY: Association for Computational Linguistics</article-title>
          . pp.
          <fpage>308</fpage>
          -
          <lpage>315</lpage>
          .
          <year>2007</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Strapparava</surname>
          </string-name>
          , Valitutti Ro.
          <article-title>WordNet-Affect: an Affective Extension of WordNet</article-title>
          .
          <source>In Proceedings of the 4th International Conference on Language Resources and Evaluation</source>
          .
          <year>2004</year>
          .
          <fpage>1083</fpage>
          --
          <lpage>1086</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Castro</surname>
          </string-name>
          , Yaritza Adame, María Peláez Brioso, Rafael Muñoz:
          <article-title>Authorship Verification, combining Linguistic Features and Different Similarity Functions</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <fpage>2015</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Efstathios</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <article-title>A Survey of Modern Authorship Attribution Methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          , Volume
          <volume>60</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>3</given-names>
          </string-name>
          , pages
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          ,
          <year>March 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Francisco Manuel Rangel Pardo, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Potthast,
          <source>Benno Stein: Overview of the 4th Author Profiling Task at PAN</source>
          <year>2016</year>
          :
          <article-title>Cross-Genre Evaluations</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <year>2016</year>
          :
          <fpage>750</fpage>
          -
          <lpage>784</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Francisco M. Rangel Pardo, Fabio Celli, Paolo Rosso, Martin Potthast, Benno Stein, Walter Daelemans:
          <article-title>Overview of the 3rd Author Profiling Task at PAN 2015</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <fpage>2015</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Francisco Rangel, Paolo Rosso.
          <article-title>On the Impact of Emotions on Author Profiling</article-title>
          .
          <source>In: Information Processing &amp; Management</source>
          , vol.
          <volume>52</volume>
          , issue 1, pp.
          <fpage>73</fpage>
          -
          <lpage>92</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Grigori</given-names>
            <surname>Sidorov</surname>
          </string-name>
          , Sabino Miranda-Jiménez, Francisco Viveros-Jiménez, Alexander Gelbukh, Noé Castro-Sánchez, Francisco Velásquez, Ismael Díaz-Rangel,
          <article-title>Sergio Suárez-Guerra, Alejandro Treviño, and Juan Gordon. Empirical Study of Opinion Mining in Spanish Tweets</article-title>
          .
          <source>LNAI 7629</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>