<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UniNE at CLEF 2017: Author Profiling Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mirco Kocher</string-name>
          <email>Mirco.Kocher@unine.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacques Savoy</string-name>
          <email>Jacques.Savoy@unine.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Dept., University of Neuchâtel</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes and evaluates a supervised author profiling model. The suggested strategy can be adapted without any problem to various languages (such as Arabic, English, Spanish, and Portuguese). As features, we suggest using the m most frequent terms of the query text (isolated words and punctuation symbols with m at most 200). Applying a simple distance measure and looking at the nearest text profiles, we can determine the gender (with the nominal values “male” or “female”) and the language variety (e.g., in Spanish the nominal values “Argentina”, “Chile”, “Colombia”, “Mexico”, “Peru”, “Spain”, or “Venezuela”). The training and test data is available for Twitter tweets (PAN AUTHOR PROFILING task at CLEF 2017). An analysis of the top ranked terms from a feature selection method allows a better understanding of the proposed assignments and presents typical writing styles for each category.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Social network applications produce a big amount of information (e.g., texts, pictures,
videos, and links) at an unprecedented scale. Texts shared on such sites like Facebook
and Twitter have their own characteristics vastly different from essays, literary texts,
or newspaper articles. This is because anybody can publish unrevised content and the
compulsion of having a fast interaction. We can observe a large variability related to
spelling and grammar. Moreover, new terms tend to appear and emoji are used
frequently to denote the author’s emotions or state of mind.</p>
      <p>The central question is, if we can detect writings by the author’s gender from those
sources, and what are the significant differences between man and women in their
writing style. Similarly, can we detect the features that best discriminate different
writings by different language varieties? The spelling difference between British
English and American English is well defined, but can we detect a variation from the
US to Canada, or Ireland and Great Britain, and can we discriminate between New
Zealand and Australia? Furthermore, since profiling is based on Twitter tweets, the
spelling may not always be perfect, and more sociocultural traits could be detected.
There are some other interesting problems emerging from blogs and social networks
such as detecting plagiarism, recognizing stolen identities, or rectifying wrong
information about the writer. Therefore, proposing an effective algorithm to the
profiling problem presents an indisputable interest.</p>
      <p>These author profiling questions can be transformed to authorship attribution
questions with a closed set of possible answers. Determining the gender of an author
can be seen as attributing the text in question to either the male or female authors.
Similarly, the language variety detection takes one of seven groups to attribute an
unknown Spanish text.</p>
      <p>This paper is organized as follows. The next section presents the test collections and
the evaluation methodology used in the experiments. The third section explains our
proposed algorithm. Then, we evaluate the proposed scheme and compare it to the best
performing schemes using four different test collections. In the last section, we explain
the decisions taken and extract typical writing styles for each category. A conclusion
draws the main findings of this study.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Test Collections and Evaluation Methodology</title>
      <p>
        The experiments supporting previous studies were usually limited to custom corpora.
To evaluate the effectiveness of different profiling algorithms, the number of tests must
be large and run on a common test set. To create such benchmarks, and to promote
studies in this domain, the PAN CLEF evaluation campaign was launched [6]. Multiple
research groups with different backgrounds from around the world have participated in
the PAN CLEF 2017 campaign. Each team has proposed a profiling strategy that has
been evaluated using the same methodology. The evaluation was performed using the
TIRA platform, which is an automated tool for deployment and evaluation of the
software [2]. The data access is restricted such that during a software run the system is
encapsulated and thus ensuring that there is no data leakage back to the task participants
[
        <xref ref-type="bibr" rid="ref7">5</xref>
        ]. This evaluation procedure also offers a fair evaluation of the time needed to
produce an answer.
      </p>
      <p>During the PAN CLEF 2017 evaluation campaign, three test collections were built.
In this context, a problem is simply defined as:</p>
      <p>Predict an author’s language variety and gender from tweets.</p>
      <p>In each collection, all the texts matched the same language. The first benchmark is
composed of an Arabic collection with the goal to predict four language varieties. The
second is an English corpus containing six varieties, the third is written in Spanish and
covers seven different varieties, while the last collection is in Portuguese based on two
language varieties. In all corpora, the additional task is to determine the author’s
gender. The training data was collected from Twitter. This year, everyone had access
to the test data twice. This means we can train and test a basic approach, improve it,
and test it again for the second and final run.</p>
      <p>An overview of these collections is depicted in Table 1. The number of samples
from the training set is given under the label “Samples” (each sample is a set of tweets)
and the mean number of tokens (isolated words and punctuation symbols) per sample
is indicated under the label “Terms”. A similar test set will then be used to be able to
compare our results with those of the PAN CLEF 2017 campaign. That datasets
remained mostly undisclosed due to the TIRA system so we don’t have information
about the average number of words per sample, but we expect a similar distribution.</p>
      <p>When considering the four benchmarks, we have 11,400 profiles in total to train our
system. When inspecting the distribution of the answers, we can find the same number
(5,700 in training) of female and male profiles.</p>
      <sec id="sec-2-1">
        <title>In each of the individual test</title>
        <p>collections, we can also find a balanced number of female and male profiles. The same
is the case for the language varieties, where each group has 600 samples. During the
PAN CLEF 2017 campaign, a system must provide the answer for each problem in an
XML structure.</p>
        <p>The response for the gender is a fixed binary choice and for the
language variety one of the fixed entries is expected.</p>
        <p>( ,  ,  ,  ) =</p>
        <p>∗
log2 (( + )∗( + )
) +
 log2 (( + )∗( + )</p>
        <p>
          ∗

)
(1)
where a, b, c, d, and n are used as indicated in Table 2. For instance, a represents the
frequency of a given term ω (e.g., “the” or “people”) in each class Γ (e.g., “female” or
“Mexico”) while d is the sum of all other terms in all other classes.
For determining the number of useful features denoted m, previous studies have shown
that a value between 200 and 300 tends to provide the best performance [
          <xref ref-type="bibr" rid="ref4">1, 7</xref>
          ]. The
Twitter tweets contained a lot of different hashtags (keyword preceded by a number
sign) und numerous unique hyperlinks. To minimize the number of terms with a single
occurrence we conflated all hashtags to a single feature and combined the
morphological variants of Twitter links to another feature. The effective number of
terms m was set to the 100 highest terms for each gender and 70 highest terms for each
language variety. In the first run we also included the 10 lowest ranked terms as a
counter indication for a given category, while this was omitted in the second run. Since
there is some overlap when combining the highest ranked terms of one class with
another, the length of the generated feature list was below 400 even for the Spanish
collection containing seven different language classes. With this reduced number the
justification of the decision will be simpler to understand because it will be based on
words instead of letters, bigrams of letters, or combinations of several representation
schemes or distance measures.
        </p>
        <p>In the current study, a profiling problem is defined as a query text, denoted Q,
containing a set of Twitter tweets. We then have multiple authors A with a known
profile. To measure the distance between Q and A, in the first run we used a variant of
the L1-norm called Canberra as shown in Equation 2, while in the second run we used
a variant of the L2 norm called Clark as shown in Equation 3:
∆
∆</p>
        <p>|  [  ]−  [  ]|
( ,  ) = ∑ =1   [  ]+  [  ]
( ,  ) = √∑ =1 (|  [[  ]]−+  [[  ]]|)2
(2)
(3)
where m indicates the number of terms (words or punctuation symbols), and PQ[ti] and
PA[ti] represent the estimated occurrence probability of the term ti in the query text Q
or in the author profile A respectively. To estimate these probabilities, we divide the
term occurrence frequency (denoted tfi) by the length in tokens of the corresponding
text (n), Prob[ti] = tfi / n. Due to the simple difference underlying the two Equations,
we do not apply any smoothing procedure to our probability estimation.</p>
        <p>To determine the gender and variety of Q we take the k nearest neighbors in the
mdimensional vector space and use majority voting. In case there is a tie between
multiple language varieties, we selected the nearest group among them. In the first run,
the parameter k was set to k=9. In the second run we increased k to k=15 for the two
smaller collections (Arabic and Portuguese) and set k=25 for the two bigger corpora
(English and Spanish). This decision was taken because of the relatively large amount
of data available, and to gain a more stable system less affected by outliers or the
imperfection of Twitter tweets. A summarization of all parameters in the two runs is
presented in Table 3.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 Evaluation</title>
      <p>Our system is based on a supervised approach and we could evaluate it using a modified
leave-one-out approach on the training set. Instead of retrieving the k nearest
neighbors, we returned k+1 candidates, but ignored the closest profile. The nearest
sample was in fact the query text with a distance of zero and thus could also serve as a
check of correctness. In Table 4a and Table 4b, we have reported the same
performance measures applied during the PAN 2017 campaign, namely the joint
accuracy of the gender and language variety.
The algorithm clearly returns the best results for the Portuguese collection as a result
of both the high gender detection accuracy and the high language variety prediction
accuracy. With the leave-one-out approach and with the large size of all collections,
we expect the results to be robust and a good prediction for the test dataset.</p>
      <p>The test set is then used to rank the performance of all 22 participants in the
competition. Based on the same evaluation methodology, we achieve the results
depicted in Table 5a and Table 5b corresponding to our two runs for all problems
present in the four test collections. As we can see the joint scores on the test corpus are
very similar to the training results. For the Arabic and English corpora, we can see a
close resemblance to the corresponding results in the training collections. In the
Spanish collection, the test performance is marginally higher (+3.5% change, +8.4%
difference), while for the Portuguese dataset, the results are slightly lower (-2.8%
change, -3.5% difference). Overall, the system seems to perform stable independent of
the underlying text collection.
This year, there were 22 participants and the task organizers provided 3 additional
baselines1. To put our achieved performance values from Table 5b in perspective we
can see in Table 6 our results in comparison with the best participant, the three
baselines, and the mean performance of all participations scores. The columns with the
average gender score, the average language variety score, and the average joint score
are each the mean over all four languages. The final overall value for the ranking is the
mean of those three average values. Overall, we are at rank 162,which is above the
average PAN scores and two of the provided baselines.</p>
      <sec id="sec-3-1">
        <title>1 http://pan.webis.de/clef17/pan17-web/author-profiling.html 2 http://www.tira.io/task/author-profiling/</title>
        <p>When analyzing the top ranked terms from the feature selection method between the
two genders or the language variety groups we can obtain a better understanding of the
proposed assignments. The gain ratio selects both features that are overly present in
each category as well as features where it’s rarity is a counterindication of a given
category. Thus, the selected features are usually the same for both gender classes. To
present typical features for each category individually, we use the Mutual Information
for the terms in Table 7. This feature selection method assigns a high value only to the
overused terms, which gives us a clearer differentiation3.</p>
        <p>In many cases, the different usage of geographical and topical terms can explain the
decision for the classification. Some location related terms are for instance in Arabic
(تيوكلا = Kuwait, ندرلاا = Jordan, سلبارط = Tripoli, رئازجلا = Algeria, سنوت = Tunis, يسنوتلا
= Tunisia), in English (Canberra, Sydney, aust, Adelaide, jp, aus, Vancouver, Toronto,
Edinburgh, Glasgow, Bristol, Dublin, Ireland, Belfast, Wellington, Auckland, nz,
Zealand, Dunedin, DC), in Spanish (chilenos, Bogotá, Cali, Medellín, mx, Monterrey,
Lima, peruano(s), Perú, Peru, Alcalá, Cataluña, Zulia, Caracas, venezolanos), and in
Portuguese (Brasil, Portugal).</p>
        <p>For topical words, we have different examples in Arabic (بردم = coach; يرودلا =
league; ةلاص # = #Prayer), in English (NHL; makeup; Microsoft), in Spanish (lagos =
lakes, forestales = forests, incendios = fires, viña = vineyard, medicinas = medicines),
and in Portuguese (campeonato = championship, jogador = player, ranking).</p>
        <p>Additionally, names of famous people in politics, music, and sports appear
frequently, such as in Arabic (ةزياع = Aiza), in Spanish (Zidane, Macri, Piñera, Duarte,
Goya, Rajoy), in English (Turnbull, Abbott, Malcom, Reuters, Jedward, Byrne, Conor,
Ethan), and in Portuguese (Eduardo).</p>
        <p>Very frequent terms such as pronouns and determiners also appear in the top 10
highest ranked terms. There are examples in Arabic (ىنإ = I am; ىتنا = you; هد = this),
in Spanish (nosotras = we; vos, os, vosotros = you), and in Portuguese (vc, você = you;
tô = I am).</p>
        <p>
          Furthermore, the frequent appearance of various heart shaped emoji in the female
categories of Table 7 in all four languages confirms previous findings that women tend
to use more expressions related to social and emotion words than men [
          <xref ref-type="bibr" rid="ref6">4</xref>
          ].
3 Some terms depend on the context in which they are used and can’t be translated accurately.
        </p>
        <p>Top terms (space separated)
ىنإ هفراع يتبيبح امام هزياع ىملس ةزياع ةفراع
liked ناح video حرش ٰ ديردم ةديرغت مزاح يرودلا بردم
هد تقب يتقولد ةدراهنلا ىتنا ىنعي ماك ىنات ىللا ىد
وفك وب ٰ هيفام ينيف مياد تيوكلا دحم نولش نيحلا
حينم هدب لأه ادح كيه ياه ندر ألا كدب ندرلاا يشا
عات ةزجعم يسنوتلا ةلاص# سنوت اده رئازجلا سلبارط ايبيل
leo taurus virgo xxx makeup xx bingo
)' badge arsenal earned league microsoft wire players developer
rangers
canberra turnbull sydney aust abbott malcolm jp adelaide scarlet
aus
vancouver toronto canadians canadian 220 nhl txt canvas rsvp
edinburgh filthy glasgow factual unlimited reuters mural bristol
drafted gems
dublin ireland commented irish scorpio jedward byrne conor
capricorn belfast
wellington auckland nz kiwi zealand dunedin earthquake )'
roundup
gorsuch emerald dems ethan scotus dc aca obamacare infamous
nsc
♡ orgullosa cansada pedidos nosotras angie dormida siiii
celosa
dt jugó rival refuerzos delantero clubes colo cont zidane
libertadores
posta hs podes vos orto lpm pelotuda bue pelotudo macri
wn piñera colo lagos incendios po metropolitana forestales viña
chilenos
bogotá bogota uribe corridas boletas falcao lleras cali plebiscito
medellín
neta mx éxico monterrey pinches duarte hidalgo slim pri
ppk lima peruanos soles ptm perú peru oe muni peruano
psoe os vosotros enhorabuena goya pp rajoy vuestro alcalá
cataluña
mud zulia vzla caracas chavista an venezolanos medicinas
chavismo hampa
sozinha cansada obrigada ranking achavam enviadas
simpático apaixonada acordada
link eduardo | obrigado milhões by • ): jogador campeonato
tô fazendo vc você kkkkk at kkkkkk brasil querendo assistir
tou portugal isto cenas crlh gira xd merdas percebo lol
This paper proposes a supervised technique to solve the author profiling problem. If a
person’s writing style may reveal his/her demographics we propose to characterize the
style by considering terms (isolated words and punctuation symbols) selected using the
gain ratio method. To take the profiling decision, we propose using the k nearest
neighbors according to a distance measure based on the L1 or L2 norm.</p>
        <p>The proposed approach tends to perform very well in Portuguese Twitter tweets for
both gender and language variety prediction. The performance of the gender detection
in Arabic, English, and Spanish was acceptable, while the language variety
classification was good considering the large number of categories. The final results
on the test collections were as expected from the training corpora, indicating that no
over-fitting occurred. Such a classifier strategy can be described as having a high bias
but a low variance [3]. Even if the proposed system cannot capture all possible stylistic
features (bias), changing the available data does not modify significantly the overall
performance (variance).</p>
        <p>Moreover, the proposed profiling can be clearly explained because it is based on a
reduced set of features on the one hand and, on the other, those features are words or
punctuation symbols. Thus, the interpretation for the final user is clearer than when
working with a huge number of features, when dealing with n-grams of letters or when
combing several similarity measures. The decision can be explained either by large
differences in relative frequencies (or probabilities) of frequent words (usually
corresponding to functional terms), topical words, or geographical terms. We were able
to show that there exists a difference in writing style between the genders and the tested
language variety groups.</p>
        <p>To improve the current classifier, we could investigate the effect of other feature
selection strategies. In this case, we want to maintain a reduced number of terms but
we can take more account of the underlying text genre, as for example, the frequent use
of emoji in tweets contain more implicit expressions and meanings. Furthermore, we
could use external resources to harvest geographical names related to the different
countries and regions to facilitate the language variety prediction. As another possible
improvement, we can ignore terms only appearing infrequently in a class. One might
also try to exploit PAN specific properties such as the requirement for equally
distributed male/female problems and for the language variety groups.
Acknowledgments. The author wants to thank the task coordinators for their
valuable effort to promote test collections in author profiling. This research was
supported, in part, by the NSF under Grant #200021_149665/1.
1.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Delta: A Measure of Stylistic Difference and a Guide to Likely Authorship</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          ,
          <volume>17</volume>
          (
          <issue>3</issue>
          ),
          <fpage>267</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2012</year>
          . Ousting Ivory Tower Research:
          <article-title>Towards a Web Framework for Providing Experiments as a Service</article-title>
          . In: Hersh,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            , &amp;
            <surname>Sanderson</surname>
          </string-name>
          , M. (eds.)
          <source>SIGIR. The 35th International ACM</source>
          ,
          <volume>1125</volume>
          -
          <fpage>1126</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>The Elements of Statistical Learning</article-title>
          .
          <source>Data Mining, Inference, and Prediction</source>
          . Springer-Verlag: New York (NY).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          7.
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Comparative Evaluation of Term Selection Functions for Authorship Attribution</article-title>
          .
          <source>Digital Scholarship in the Humanities</source>
          ,
          <volume>30</volume>
          (
          <issue>2</issue>
          ),
          <fpage>246</fpage>
          -
          <lpage>261</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Machine Learning in Automatic Text Categorization</article-title>
          .
          <source>ACM Computing Survey</source>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          4.
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>The Secret Life of Pronouns. What our Words Say about us</article-title>
          . Bloomsbury Press: New York (NY).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          5.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling</article-title>
          . In: Kanoulas,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Handbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , &amp;
            <surname>Toms</surname>
          </string-name>
          ,
          <string-name>
            <surname>E</surname>
          </string-name>
          . (eds.)
          <source>CLEF. Lecture Notes in Computer Science</source>
          , vol.
          <volume>8685</volume>
          ,
          <fpage>268</fpage>
          -
          <lpage>299</lpage>
          . Springer: Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter</article-title>
          . In:
          <article-title>CLEF 2017 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings. CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>