Author Profiling, instance-based Similarity
                     Classification
                        Notebook for PAN at CLEF 2017


    Yaritza Adame-Arcia1, Daniel Castro-Castro1, Reynier Ortega Bueno1, Rafael Mu-
                                         ñoz2
              1Desarrollo de Aplicaciones, Tecnología y Sistemas DATYS, Cuba


                   yaritza.adame@datys.cu, {reynier.ortega,
                         daniel.castro}@cerpamid.co.cu
     2Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, España


                                     rafael@dlsi.ua.es


        Abstract. In digital documents analysis for forensic applications, when anony-
        mous documents are presented and it is not possible with the available tools to
        determine the true author of the document, there are of vital importance methods
        that identify the characteristics of the Author Profile (Gender, Age, Personality,
        etc.). We propose to use a simple method of classification based on the similarity
        between objects, considering different features for documents representation: (a
        document corresponds to a set of tweets of a user), the terms used in the tweets,
        as well as characteristics of opinion and subjectivity presented in them. Our goal
        will be to classify, based on the content of the tweets, the Gender and language
        variety of an author from an unknown set of tweets corresponding to him. In the
        experiments we observed good results in Gender classification, but low values in
        language variety classification. We processed only the English dataset.

        Keywords: Author profiling, instance-based classification, tweets gender
        classify, tweets language variety classify


1       Introduction

The PAN Profiling task for this edition is as follows: "Gender and language variety
identification in Twitter. Demographics traits such as gender and language have so far
investigated separately. In this task we will have participants with a corpus annotated
with authors' gender and their specific variation of their native language:

 English (Australia, Canada, Great Britain, Ireland, New Zealand, United States)
 Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela)
 Portuguese (Brazil, Portugal)
 Arabic (Egypt, Gulf, Levantine, Maghrebi)
Although we suggest to participate in both subtasks (gender and language identifica-
tion) and in all languages, it is possible participating only in one of them and in some
of the languages.”
   The proposal to identify these demographic traits in tweets implies that the natural
language processing tools widely used for long documents analysis must be adapted to
the features of the textual genre and the writing characteristics presented in tweets. We
must emphasize that the complexity lies in the fact that for this genre there are no lin-
guistic rules or writing standards. Language is informal, usually direct and full of emo-
tions.
   In past tasks of demographic traits identification on PAN evaluation framework [5]
[6], tweet genre was used and many works presented used lexical content (words, in-
formal text, jargon) and characteristic features of the genre (URL, hashtags, mentions,
retweet, emoticons, etc.). The generality of the proposals uses the classic Bag of Words
representation of documents, employing in addition to the mentioned features, n-grams
of some of them, for example, words n-grams, lemmas n-grams, POS-Tagging (Part of
Speech Grammatical Categories) n-grams, etc. The fundamental difference of the pro-
posal of this year to previous proposals, lies in evaluating and classifying by variety of
the language.
   For the classification process, decision tree-based approximations have been used,
as well as SVM by a large number of competitors and a few others have used distance-
based approximations to predict the closest class [14][15].
   We are interested in implementing a distance-based classification strategy and with
this, use previous results presented in the Author Identification edition of 2015 [3]. We
will combine features of the lexical content of the tweets, their characteristic features,
and polarity and emotion features of previous works of our group used in tasks of sen-
tence polarity classification. We will experimentally evaluate the differences between
an instance-based proposal and a prototype-based proposal, in the same distance-based
strategy.


2      Implemented methods

We used two classification strategies, considering two documents representation vari-
ants. An instance based representation of the documents, where the set of tweets of an
author (for each author it is available her/his gender and language variety) represents a
document and with this idea, for each class (female class, male class) we have a set of
documents. The second variant is a prototype-based representation, where a single doc-
ument is formed for each class, and this document is constructed with all the tweets of
each of the sample authors per class.
   Figure 1 shows graphically the architecture of our proposal with the instance-based
strategy.
      Fig. 1. Architecture proposed for author profiling. Instance based classification strategy.


2.1       Features and tweets pre-processing stage:
   The first step correspond to build the documents that will be used as objects for the
similarity calculation in the classification method. For each author, we receive the set
of tweets that she/he wrote, and with the concatenation of these tweets is formed a
document for this author. Remember that, of each author, what we have is the gender
and the language variety. We perform a pre-processing of the document in two stages.
In a first stage, we segment the tweets with a tokenizer offered in FreeLing [13]
[http://nlp.lsi.upc.edu/freeling/], specialized for the processing of tweets. Subsequently
we proceed to the expansion of short terms used and contractions, and characteristics
traits that are used in tweets such as the Hashtags, URLs, mentions, are replaced by
certain fixed patterns, those traits we consider the content does not contribute to differ-
entiate between tweets of different profiles. After these transformations, we have nor-
malized the tweets a bit and next proceed to perform a syntactic analysis with the tra-
ditional POS-Tagging tools for English and Spanish according to the language of the
tweets.
   For the representation we use the classic Bag of Words and in this we integrate:

 The lexical terms, the lemmas of these and the grammatical category.
 Characteristics features of the tweets.
 Features of subjectivity and opinion mining analysis [7].
    With the lexical terms and lemmas, we hope to differentiate the documents of each
class, because some of this features are proper of their class. For example, for language
variety, some terms are used by Colombians unlike the rest, and thus similarly for each
variant of Spanish. Considering the frequency of use of grammatical categories, would
allow us to differentiate between tweets written by the male gender and those written
by the female gender. For example, in [17] it is exposed several differences in the use
of words and different Part of Speech analyzing women and men writing style.
    The characteristic features of the tweets we extract correspond to hashtags, the men-
tion of author, the mark of retweet, the use of URL, the use of intensifications (capital
letters, deformation of words by repetition of characters, use of admiration signs), use
of laughter expressions, use of emoticons and the use of informal language. For each
of these traits we consider the position in which they are used, that is, the number of
times used at the beginning of the tweet, at the end or elsewhere.
    Additionally, we include the analysis of the frequency of features with subjective
information, for example, the number of positive or negative emoticons; the words used
were categorized as Positive (P), High Positive (HP), Negative (N) and High Negative
(HN), using the frequencies of this categories. We used a word polarity resource in
Spanish and English taken from [12], resources of emotion in Spanish [8] [11] and for
English [2], and finally the resources of appraisal for Spanish [9] [10] and English [1].


2.2    Classification stage:
   For the classification of the set of tweets of an author in the Demographic traits of
gender and language variety, we tried with two strategies. A strategy in which each
document (set of author tweets) is used as an instance of the class to which it belongs
and for the second strategy we construct a prototype of each class using the extracted
features of the set of documents belonging to the class. Each of these strategies were
evaluated with the tweets collections of the training set and was selected for the final
evaluation, the one that showed more stable results in different executions.
   In the instance-based strategy, it is calculated the similarity of the new document
with each sample document of the class, and then is computed the average similarity
obtained with the class. This analysis is done with each class of a Demographic Trait
and the object is going to belong to the class with which it obtains greater average
similarity. In the prototype-based strategy, the similarity of the new document is calcu-
lated with the class prototype. This analysis is done for each class and the object is
going to belong to the class in which the similarity obtained was the highest (1-NN
[4][16]).
   The classification is done independently for each Author Demographic Trait, Gender
classes (2 classes) and language variety (for English 6 classes and for Spanish 7 clas-
ses). Finally, the result is the combination of these two classifications.


3      Experiments and results

The initial experiments were performed with the training collection released for this
year's 2017 task. We evaluated the accuracy obtained by performing a 2-cross fold val-
idation. In addition, we considered the training collection of the 2015 edition for the
Gender and Age classes. The description of these collections can be reviewed in [18]
[6]. In Table 1 we include the values obtained in the tests with the two representation
strategies, instance-based and prototype centroid-based one, using the collection of
2017. In table 2, we present the results with the collection of 2015.

                  Table 1. Accuracy in 2-cross fold validation train 2017
                                                            Spanish         English
                                      gender                  0,6            0,56
            Instance based        Language variety            0,2            0,23
                                       join                  0,12            0,14
                                      gender                 0,63            0,65
            Prototype based       Language variety            0,3             0,3
                                       join                  0,19             0,2

                  Table 2. Accuracy in 2-cross fold validation train 2015
                                                     Spanish       English
                                         gender        0,68          0,56
                  Instance based          Age          0,46          0,45
                                           join        0,29          0,21
                                         gender        0,68          0,58
                  Prototype based         Age          0,21           0,1
                                           join        0,17           0,1
   Evaluating the results shown in the two tables, we consider that the classification is
more stable with the instance-based strategy, so we decided to include this configura-
tion in the evaluation of the task of this year. The results obtained can be observed in
the summary published by the organizers and in the following table.
   The results with the test dataset are shown in [18] and presented on the PAN web
site. We got the lowest values of all the participants, and only run successfully for the
English dataset. Comparing the results obtained using the BOW-baseline that uses the
1000 most frequent terms, we conclude that one of our problems is that we need to
analyze and reduce the features used. We processed the dataset using the instance-based
strategy and perhaps the results could be better if we used de prototype-based strategy
whit feature selection methods.
4      Conclusions and future work

A representation that considers the terms used in tweets, is able to differentiate to a
large extent the sets of tweets written by authors of different genres. The proposed sub-
jectivity and opinion features allow improvements in classification, but they are not
substantial improvements. In the evaluation we made with the collections of 2015, we
verified that each of the sets of features separately allows good identifications of the
genre and that their combination increases the values obtained. The classification in
language variety maintains low results and to a great extent this is due to the little dif-
ference that is observed between some of these classes and that many terms used by the
authors are of universal character and are standardized in the community.
   We achieved the lowest values of all the team and considering that a baseline method
using the 1000 most frequent terms in a Bag of Word representation got better results,
then we need to do an exhaustive evaluation of our method.
   We must work on features selection strategies and the analysis of representative ob-
jects to each of the classes. We propose to evaluate a classification with rejection or
abstention for those users whose tweets do not contain characteristic features with their
class, for example for the idea of language and not penalize so much the possible bad
classifications.


5      References
 1. Bloom, K., Garg, N., & Argamon, S. Extracting Appraisal Expressions. In Proceedings of
    NAACL HLT 2007. Rochester, NY: Association for Computational Linguistics. pp. 308–
    315. 2007
 2. Carlos Strapparava, Valitutti Ro. WordNet-Affect: an Affective Extension of WordNet. In
    Proceedings of the 4th International Conference on Language Resources and Evaluation.
    2004. 1083--1086
 3. Daniel Castro, Yaritza Adame, María Peláez Brioso, Rafael Muñoz: Authorship Verifica-
    tion, combining Linguistic Features and Different Similarity Functions. CLEF (Working
    Notes) 2015
 4. Efstathios Stamatatos. A Survey of Modern Authorship Attribution Methods. Journal of the
    American Society for Information Science and Technology, Volume 60, Issue 3, pages 538-
    556, March 2009.
 5. Francisco Manuel Rangel Pardo, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin
    Potthast, Benno Stein: Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre
    Evaluations. CLEF (Working Notes) 2016: 750-784
 6. Francisco M. Rangel Pardo, Fabio Celli, Paolo Rosso, Martin Potthast, Benno Stein, Walter
    Daelemans: Overview of the 3rd Author Profiling Task at PAN 2015. CLEF (Working
    Notes) 2015
 7. Francisco Rangel, Paolo Rosso. On the Impact of Emotions on Author Profiling. In: Infor-
    mation Processing & Management, vol. 52, issue 1, pp. 73-92
 8. Grigori Sidorov, Sabino Miranda-Jiménez, Francisco Viveros-Jiménez, Alexander Gelbukh,
    Noé Castro-Sánchez, Francisco Velásquez, Ismael Díaz-Rangel, Sergio Suárez-Guerra, Ale-
    jandro Treviño, and Juan Gordon. Empirical Study of Opinion Mining in Spanish Tweets.
    LNAI 7629, 2012, pp. 1-14.
 9. Hernández, L., López-Lopez, A., & Medina-Pagola, J. E. (2009). Recognizing Polarity and
    Attitude of Words in Text. In In Proc. F 14th Portuguese Conference on Artificial Intelli-
    gence, (EPIA’2009) (pp. 525–536). Aveiro, Portugal.
10. Hernández, L., López-Lopez, A., & Pagola, J. E. M. (2011). Classification of Attitude Words
    for Opinions Mining. International Journal of Computational Linguistics and Applications,
    2(1–2), 267–283.
11. Ismael Díaz Rangel, Grigori Sidorov, Sergio Suárez-Guerra. Creación y evaluación de un
    diccionario marcado con emociones y ponderado para el español. Onomazein , 29, 23 p.,
    2014, DOI 10.7764/onomazein.29.5
12. Jose Manuel Yero Moreno, Reynier Ortega Bueno. Método no supervisado para la clasifi-
    cación de polaridad en Twitter. VII Conferencia Internacional de Ingeniería Eléctrica. . pp.
    1 - 4. Jun, 2014. ISBN: 978-959-207-529-0.
13. Lluís Padró, Evgeny Stanilovsky. FreeLing 3.0: Towards Wider Multilinguality Proceedings
    of the Language Resources and Evaluation Conference (LREC 2012) ELRA. Istanbul, Tur-
    key. May, 2012.
14. Mirco Kocher, Jacques Savoy: UniNE at CLEF 2016: Author Profiling. CLEF (Working
    Notes) 2016: 903-911
15. Maria José Garciarena Ucelay, Maria Paula Villegas, Dario G. Funez, Leticia C. Cagnina,
    Marcelo Luis Errecalde, Gabriela Ramírez-de-la-Rosa, Esaú Villatoro-Tello: Profile-based
    Approach for Age and Gender Identification. CLEF (Working Notes) 2016: 864-873
16. Patrick Juola. Authorship Attribution. In Foundations and Trends in Information Retrieval,
    Volume 1, Issue 3, March 2008.
17. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count: LIWC
    2001. Mahway: Lawrence Erlbaum Associates 71 (2001)
18. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th Author Profiling Task at
    PAN 2017: Gender and Language Variety Identification in Twitter. In: Working Notes Pa-
    pers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-
    WS.org (Sep 2017)