=Paper= {{Paper |id=Vol-2380/paper_244 |storemode=property |title=Author Profiling using Semantic and Syntactic Features |pdfUrl=https://ceur-ws.org/Vol-2380/paper_244.pdf |volume=Vol-2380 |authors=György Kovács,Vanda Balogh,Purvanshi Mehta,Kumar Shridhar,Pedro Alonso,Marcus Liwicki |dblpUrl=https://dblp.org/rec/conf/clef/KovacsBMSAL19 }} ==Author Profiling using Semantic and Syntactic Features== https://ceur-ws.org/Vol-2380/paper_244.pdf
Author Profiling Using Semantic and Syntactic Features
                            Notebook for PAN at CLEF 2019

        György Kovács1,2 , Vanda Balogh3 , Purvanshi Mehta4 , Kumar Shridhar4 ,
                        Pedro Alonso1 , and Marcus Liwicki1
       1
         Embedded Internet Systems Lab, Luleå University of Technology, Luleå, Sweden
           2
             MTA-SZTE Research Group on Artificial Intelligence, Szeged, Hungary
                3
                  Institute of Informatics, University of Szeged, Szeged, Hungary
                               4
                                 MindGarage, Kaiserslautern, Germany
        gyorgy.kovacs@ltu.se, bvanda@inf.u-szeged.hu, purvanshi.mehta11@gmail.com,
            shridhar.stark@gmail.com, pedro.alonso@ltu.se, marcus.liwicki@ltu.se



           Abstract In this paper we present an approach for the PAN 2019 Author Pro-
           filing challenge. The task here is to detect Twitter bots and also to classify the
           gender of human Twitter users as male or female, based on a hundred select
           tweets from their profile. Focusing on feature engineering, we explore the se-
           mantic categories present in tweets. We combine these semantic features with
           part of speech tags and other stylistic features – e.g. character floodings and the
           use of capital letters – for our eventual feature set. We have experimented with
           different machine learning techniques, including ensemble techniques, and found
           AdaBoost to be the most successful (attaining an F1-score of 0.99 on the devel-
           opment set). Using this technique, we achieved an accuracy score of 89.17% for
           English language tweets in the bot detection subtask.


1     Introduction
With the increasing use of social media [5], and its growing effect on our lives it is
becoming more and more important to provide automatic methods that are capable
of processing social media content. For one, it is paramount for companies interested
in targeted advertisement to automatically identify certain traits of users, such as age,
location, personality, and gender, even if the users do not report these traits themselves
(although this application admittedly raises many ethical implications and challenges).
More important is however the identification of fake news, and the detection of social
media bots. With the growing role of social media as a primary news source [1], and
the increasing effect of social media bots on political discourse [9] (in particular, their
ability to effectively spread a large amount of misinformation in critical times [18]), it is
vital to have the ability to monitor or even filter out such accounts. This, however, first
requires the ability to efficiently identify such accounts. For this reason, when working
on the bots and gender profiling PAN challenge [23,22], our main area of focus was the
bot detection task.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
1.1   Related Work


Social media analytics has a wide range of applications from understanding customer
sentiment to determining the political orientation of a crowd. Another area of applica-
tion of social media analytics that is growing rapidly is that of bot identification and fake
news detection. The methods deployed in these tasks range from the use of various clas-
sical machine learning algorithms [8] to the more recent deep learning approaches [29].
    Decision trees have been a popular choice in the task of bot vs human classifica-
tion. For example, Botometer [30], a popular bot detection tool, uses random forests to
identify twitter bots. Hall et al. [13] also applies random forests to remove bots from
Wikipedia pages. One good quality of decision trees is that they work well with many
languages, as their power to classify stance and gender in Spanish is shown in the works
of Vinayakumar et al. [29] for the Ibereval 2017 task [27]. Besides decision trees, other
well-known machine learning algorithms have also been used for the task, namely Sup-
port Vector Machines (SVMs) [7,29], Logistic Regression [7], and K-Nearest Neigh-
bours [10]. Convolutional Neural Networks (CNNs) [14], Recurrent Neural Networks
(RNNs) [7], and combinations of the two [3] have also been used for opinion detection
in social media.



1.2   Bot and Gender Profiling


The research problem to be undertaken in this work is the PAN 2019 bot and gen-
der profiling task. As the challenge is described in detail in accompanied overview
papers [6,23], we only give a short description of the task here, and for more detail
we refer the reader to the aforementioned publications. In this challenge, each team
performs the task of classifying twitter profiles based on a randomly selected set of
a hundred tweets, as bots or humans. Furthermore, in case an author is identified as
human, the additional task is to identify the gender of said human as male or female.
For submissions and evaluation, the PAN task uses TIRA virtual machines where teams
upload and run their software[21]. The author profiling challenge is organised for both
English and Spanish language tweets, but due to the time restraints, here we only tackle
the problem for English. However, given sufficient time, the methods described in this
paper could also be applied to Spanish language as well.



Data Partitioning While testing is carried out on a held-out dataset that is not pub-
licly available, the training data of 4120 twitter profiles was publicly released, and is
available in xml format. The classes here are balanced, which means that half of the
profiles belong to bots, while the other half belong to human twitter users. Conversely,
half of the human authors are female, and the other half are male. For our experiments
we partition this data into training and validation sets, using a randomly selected 67%
of the data for training purposes, and 33% for meta-parameter optimization, as well as
for validating our trained models.
2     Methods

Motivated by the positive results of classical machine learning approaches mentioned
in Section 1.1, we explore how these methods would fit the task at hand. In our final
submission we rely only on our best performing model (i.e. AdaBoost), however we
find it important for future research in the topic to share our experiments with other
methods as well. Hence, in this section we discuss three widely-used methods, namely
AdaBoost, Random Forest, and Recurrent Neural Networks.


2.1   AdaBoost

Boosting [25,24] is a popular family of algorithms for ensemble learning. The main idea
behind these algorithms is to combine several "weak learners" (i.e. classifiers that may
perform poorly, but still perform better than random guessing) into a "strong learner",
or in other words, a robust classifier. Here, we used one early, successful boosting algo-
rithm published by Freund and Schapire [12]. AdaBoost builds its strong learner on top
of the weak learners by weighting each classifier according to its performance. To com-
pute such weights, weak classifiers are trained on the training set, allowing to calculate
the probability of error. Each classifier is weighted according to such probabilities and
included in the AdaBoost model.


2.2   Random Forest

Random Forest [2] is a supervised machine learning classifier where bootstrapping
method is used to partition features into multiple training subsets. It trains individ-
ual decision trees for each training subset in the training data. The final classification is
given by collecting decisions from all the trees and choosing the final class having max-
imum scores. The scoring can be done by assigning equal votes to the final decisions of
all trees or using a weighted strategy that can be adopted to assign unequal weights to
the final decisions of the resulting trees.


2.3   Recurrent Neural Networks

Deep learning attempts to model high-level abstractions in data. Here, we deploy a pop-
ular deep learning architecture, namely Recurrent Neural Networks (RNNs). RNNs are
particularly suited for tasks where the output is not just dependent on the present in-
put, but also on past input several time steps removed. The contextual meaning within
a tweet and the order of tweets carry some extra information prompting the need to
employ methods that have the potential to exploit these dependencies. As these depen-
dencies may be long term (spanning up to a hundred tweets), a vanilla RNN may face
the issue of vanishing gradient. Because of this, we use the Long Short Term Memory
(LSTM) variant in our work to counter this problem.
3     Features
Based on the results of preliminary experiments using neural networks, our focus was
on combining classical machine learning algorithms with carefully engineered features.
Here, the same set of features are employed for the bot detection and the gender predic-
tion tasks. We calculated most of these features for each tweet independently, then aver-
aged them over a profile. When the computations were carried out differently, we state
this explicitly. During our experiments, we noticed that some features share the same
value for all Twitter profiles. Later on, these features were dropped. Lastly, after feature
extraction we scaled our final set of features using scikit-learn’s StandardScaler [19].

3.1   URL Features
We experimented with several features based on the URLs present in tweets, particu-
larly domain-based features (e.g. the ratio of the most commonly linked domains, the
ratio of links leading to twitter, the ratio of the most commonly linked twitter profiles).
However, as the majority of URLs present in the tweets were first processed by link
shortening services, this required Internet access, which is not available in the TIRA
virtual machine [20,21]. Hence in the final feature set we confine ourselves to the use
of the average number of URLs present in a twitter profile.

3.2   Emoticon Features
Another feature used in our experiments is the number of emoticons (or emojis) present
in each tweet. For the extraction of this feature we use the freely available emoji for
Python project [16]. Following the work of Zhenpeng et al. [4] we have also exper-
imented with the use of more high level features based on the emoji-use of twitter
profiles. This includes both the emoji frequency and emoji preference features of the
original publication (for more details, see [4]). In our preliminary experiments how-
ever, these features did not significantly improve the results of either task. Thus, in our
final submission we only use the average emoticon count per tweet in our feature set.

3.3   Stylistic Features
For each tweet we detect and count character floodings, capital letters, sentences and
tokens. The average number of capital letters per word is also taken into consideration
alongside the Flesch reading-ease score (FRES) [11], calculated as follows:
                                                                                 
                                     #words(text)                 #syllables(text)
FRES(text) = 206.835 − 1.015                            − 84.6                        .
                                   #sentences(text)                #words(text)
Furthermore, on tweet and profile level, we count the number of tokens that are repeated
more than two times and among the repetitive tokens we report the maximum number
of repetitions. For example for the following tweet: “Hairy cats like other cats that are
not hairy. However, hairy dogs like cats that are not hairy.” the tokens that are repeated
more than two times are hairy and cats, so the number of tokens repeated is 2 and the
token hairy is repeated most times, 4 times. Altogether, we have 10 stylistic features.
            800                                            bot                                                                       bot
            700
                                                           human                    1000                                             human
Frequency   600                                                                       800




                                                                        Frequency
            500
                                                                                      600
            400
            300                                                                       400
            200
                                                                                      200
            100
              0                                                                         0
                  0.0 0.33 0.66 0.99 1.32 1.66 1.99 2.32 2.65 2.98                          0.0 0.53 1.06 1.59 2.13 2.66 3.19 3.73 4.26 4.79
                             Average ADV/profile                                                      Average PRON/profile
(a) Average number of adverbs (e.g. very, to-                            (b) Average number of pronouns (e.g. I, you,
morrow, up, who, there) used among bot and                               he, myself, themselves, someone) used among
human profiles                                                           bot and human profiles

            400
                                                           male                       300                                            male
            350                                            female                                                                    female
            300                                                                       250
Frequency




                                                                          Frequency
            250                                                                       200
            200                                                                       150
            150
                                                                                      100
            100
             50                                                                        50
              0                                                                         0
                    0.38 0.71 1.05 1.39 1.73 2.07 2.40 2.74 3.08 3.42                          0.95 1.59 2.23 2.87 3.52 4.16 4.80 5.45 6.09 6.73
                              Average ADJ/profile                                                     Average NOUN/profile
(c) Average number of adjectives (e.g. big, (d) Average number of nouns (e.g.girl, dog,
nice, green, last) among male and female pro- book, beauty) used among male and female
files                                         profiles

Figure 1: Histograms on the average use of a certain type of POS per twitter profile
comparing bots to humans and males to females, respectively


3.4           POS Tags

We count the POS tags for each tweet using spaCy’s POS tagger [15] including a total
number of 19 POS tags. Indeed, the average number of POS tags per profile could be
important – Figures 1a and 1b illustrate that humans tend to use more pronouns and
adverbs than bots in their tweets. Furthermore, as Figures 1c and 1d indicate, females
on average include more adjectives and nouns in their tweets than males do.


3.5           Topic Features

Our motivation is to explore the semantic topics and categories an author tends to tweet
about. For this reason, we employ the SEMCAT [26] and the SemCor [17] datasets on
lemmatized words. The SEMCAT (SEMantic CATegories) dataset contains more than
6,500 English words grouped under 110 semantic categories describing diverse types of
relations. SemCor is a WordNet-annotated corpus that captures, among others, seman-
tic category annotations for verbs and nouns. We use the SemCor dataset constructed
   category sample words                                                 category sample words
   car      auto buggy car hybrid jeep limo                              animal cow dog eggs fur horn tail
   clothes apparel bikini fashion fur jeans ring                         body     artery bathe neck nucleus relax shave
   family   children engaged engagement family love                      commu- counsel debate description horn inter-
            wife
                                                                         nication view session
   food     breakfast carbohydrate chocolate cook hun-
            gry restaurant                                               food     beer honey lamb leg produce ration
   money atm bank currency euro investor withdraw                        location aegean area baltimore china location
   weather biosphere cyclone degree humidity meteo-                               neighborhood
            rology unstable                                              time     0 acceleration calendar future youth yr
                              (a) SEMCAT                                                              (b) SemCor
              Table 1: Representative categories and their 5 sample words from two datasets

by Tsvetkov et al. [28], where words appearing less than 5 times are omitted. This
leaves us with more than 4,000 words and 41 categories. Table 1 shows representative
SEMCAT and SemCor categories and their sample words. The categories (and their
words) are not differentiated based on their source datasets, which means that we work
with a total number of 133 topic features. As illustrated in Figure 2a, there are more
bot profiles that use a lot of computer related words on average, whereas, as Figure 2b
shows, humans tend to tweet more about emotions. By comparing males with females,


            2000                                                                   1000
                                                             bot                                                                    bot
            1750                                             human                                                                  human
            1500                                                                     800
Frequency




                                                                       Frequency




            1250                                                                     600
            1000
             750                                                                     400
             500
                                                                                     200
             250
               0                                                                       0
                    0.0 0.31 0.62 0.93 1.24 1.55 1.86 2.17 2.48 2.79                       0.0 0.24 0.49 0.74 0.98 1.23 1.48 1.72 1.97 2.22
                            Average computer/profile                                                Average emotion/profile
 (a) Average use of computer related words for                          (b) Average use of emotion related words for
 bot and human profiles                                                 bot and human profiles

                                                             male                                                                   male
              400                                            female                  400                                            female
              300                                                                    300
  Frequency




                                                                         Frequency




              200                                                                    200

              100                                                                    100

                0                                                                      0
                    0.0 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63                       0.0 0.06 0.13 0.20 0.27 0.34 0.40 0.47 0.54 0.61
                            Average christmas/profile                                               Average baseball/profile
 (c) Average use of christmas related words for (d) Average use of baseball related words for
 male and female profiles                       male and female profiles

Figure 2: Histograms on the average number of words related to a certain type of se-
mantic category per twitter profile comparing bots with humans and males with females
Figure 2c indicates that females describe more christmas related words in their tweets,
while males tweet more about baseball, as shown in Figure 2d.

4      Results and Discussion
After concatenating all features, each twitter profile was described by a 159 dimensional
feture vector. As discussed in Section 1.2, to carry out our experiments we first split the
dataset into a train and validation set in a 2:1 ratio. We thus created a train set with 2760
examples, and a validation set with 1360 examples. We split these data sets further, to
create separate training and validation sets for the two sub-tasks, namely bot detection
(a two-class classification task with bot and human labels) and gender classification (a
two-class classification task with male and female labels). Lastly, we combined the two
models to perform a three class classification task with bot, male, and female labels.
In this section we discuss experimental results in this order. First, the results of the bot
vs human classification task are discussed. This is followed by the discussion of the
results on the gender classification task, and the results of the three class classification
task. Lastly, we conclude this section by presenting the results we attained on the held
out official test set. It should also be noted that the results reported here as well as the
cod for our experiments are available on github1 .

4.1     Bot vs Human Classification
We benchmarked the Bot vs Human Classification task using six popular classification
methods. The resulting precision, recall, and F1 scores are listed in Table 2. As can
be seen in Table 2, classical machine learning algorithms performed much better than
Bi-directional LSTMs. Furthermore, all ensemble methods – random forest, AdaBoost,
bagging classifier, gradient boost classifier – resulted in higher scores than those at-
tained using SVMs. Table 2 also shows that the best performance was achieved when
using one of the two boosting methods, AdaBoost performing slightly better. For this
reason in the remaining tasks our focus was on ensemble tasks, and we did not carry
out experiments with LSTMs or Support Vector Machines.


                             Classifiers        Precision Recall F1 Score
                           Random Forest           97      97       97
                              AdaBoost             99      99       99
                         Bagging Classifier        97      97       97
                      Gradient Boost Classifier    98      98       98
                      Support Vector Machines      94      94       94
                        Bi-directional LSTM         -       -       83
Table 2: Precision, Recall and F1-score (in percent) average score on Bot vs Human
Classification Task on the validation set using various classification methods.



 1
     https://github.com/purvanshi/Gender-and-bot-detection
     Class       Precision Recall F1-score           Class       Precision Recall F1-score
    female          83      84       83             female          88      92       90
     male           81      87       84              male           90      89       89
weighted average    83      85       83         weighted average    88      91       89
         (a) Random Forest Classifier                      (b) AdaBoost Classifier

     Class       Precision Recall F1-score           Class       Precision Recall F1-score
    female          84      85       84             female          82      85       84
     male           80      87       83              male           83      84       84
weighted average    83      86       83         weighted average    82      85       84
            (c) Bagging Classifier                       (d) Gradient Tree Boosting
Table 3: Precision, Recall and F1-score (in percent) of the gender classification task on
the validation set using various ensemble methods for classification.




4.2   Gender Classification Task (Male vs Female)



A markedly higher performance resulting from the use of decision tress on the initial
bot detection task supported our earlier decision about focusing on classical machine
learning algorithms. Thus for later tasks we only carried out experiments using the
four ensemble methods that provided higher scores. In these further experiments we
first examined the capability of these ensemble methods to differentiate between twitter
profiles that belong to male and female users. The resulting precision, recall, and F1
scores are listed in Table 3.
    When comparing the resulting scores in Table 3 to those in Table 2 we see that all
algorithms result in markedly higher scores when applied for bot detection than when
the same algorithms are applied for gender classification. This may suggest that the task
of gender classification is more difficult than that of bot detection. It can also signify,
however, that the two tasks require a different set of features, or different machine learn-
ing methods. Another possible explanation for this phenomenon may be that we have
twice as much data available for the task of bot detection than we do for the task of
gender classification. A more thorough investigation of this question is for future work,
as the present experimental results are not sufficient to provide a definitive answer.
    Table 3 also shows that with each classifier we have similar scores – at most 1% F1–
score difference – for the male the female class. It can be observed as well that recall
scores tend to be slightly higher than precision scores with the exception of AdaBoost
where the precision score for the male class is slightly higher than the recall score for
the same class. Lastly, we can also notice that while the weighted average of F1-scores
is very similar for three of the methods, it is significantly higher for AdaBoost. We also
reported higher scores for AdaBoost on the bot detection class as well, the difference
here, however is much more pronounced.
     Class       Precision Recall F1-score           Class       Precision Recall F1-score
      bot           98      95       96               bot          100      98       99
    female          83      83       83             female          88      92       90
     male           81      84       82              male           90      88       89
weighted average    90      89       89         weighted average    94      94       94
         (a) Random Forest Classifier                      (b) AdaBoost Classifier

     Class       Precision Recall F1-score           Class       Precision Recall F1-score
      bot           99      94       97               bot           98      97       98
    female          84      85       84             female          82      84       83
     male           80      86       83              male           83      83       83
weighted average    90      90       90         weighted average    90      90       90
            (c) Bagging Classifier                       (d) Gradient Tree Boosting
            Table 4: Classification results on Three class classification task.

4.3   Three Class Classification Task (Bot vs Male vs Female)
As a final experiment on the validation set, we evaluated the performance of decision
tree classifiers on the three class classification task (bot vs male vs female). The result-
ing scores are listed in Table 4, which indicates that for each classifier the bot class has
significantly higher scores – above 90%, while the male and the female classes have
scores around 80–85% may indicate male vs female classification being more difficult
than the bot detection task. The resulting scores in Table 4 also show that AdaBoost can
attain a markedly higher performance than the other three decision tree based classifi-
cation methods we used in our experiments.

4.4   Discussion
In all three experiments, we found the F1 scores provided by the AdaBoost Classifier to
be the highest (producing +3% higher scores on average than the average score gotten
using the other decision tree based classifiers). Another interesting observation is the
similar performance of the other three decision tree based methods used which we sus-
pect may be an indication that no one feature is generally better than the other. We have
also found that deep learning based methods (bidirectional LSTMs, in particular) did
not perform well on the task. This might be due to the limited amount of data available.
This issue was accentuated by the restrictions of the competition that limited the use
of extra data for the competition, which prevents the use of transfer learning that may
alleviate the problem of data scarcity.

4.5   TIRA Evaluation
Lastly, we evaluated our best performing method (AdaBoost) on the official test set of
the competition. Given that according to the regulations of the competition, the results
of only one (the last) run were to be shared by the organisers, here we used AdaBoost
only (as in our preliminary experiments on the development set it was the best perform-
ing method).
                                 Task          Validation Test
                             Bot detection      99.04% 89.17%
                          Gender classification 93.75% 35.87%
Table 5: Accuracy scores got using AdaBoost for the bot detection and gender classifi-
cation task, using our development set, and the official test set


    The resulting accuracy scores are listed in Table 5, which indicates there is a marked
drop in performance from the validation set to the test set. This drop in performance is
less pronounced on the task of bot detection, as the performance of AdaBoost on the
Test set is still close to 90%. One possible explanation for this can be if the bots in the
two sets were of different domain. In Section 3.5 for example we discuss the prevalence
of computer related topic words in the tweets of bot profiles, this however may be due
to the overrepresentation of bots in the training set that advertised positions in the IT
industry. The drop is much more striking in the case of gender classification. We should
note here, however that due to an error in the process of generating output (the algorithm
mistakenly outputs a male or female label for the gender task, even if it identified the
profile as a bot bofore), our ceiling here is only 50%, and thus we do not think this
score is representative of the generalisation capabilities of our model. Overall, we can
say however that as it pertains to the generalisation ability of our model, there is much
room for improvement still.


5   Conclusions and Future Work

In this paper we proposed an efficient way to extract semantic and syntactic features
from twitter profiles. For this we take use of the URLs, emoticons, tokens, and capital
letters used in the tweets as different features. The syntactic features were extracted
using POS tags. We used semantic categories employing the SEMCAT and semcor
datasets which altogether capture 133 categories. We present the results on binary (hu-
man - bot, male - female) and multi label (bot, male, female) classification tasks using
various machine learning and deep learning techniques. The use of languages in tweets
could be analyzed or can be used as another feature. In this work we used the same fea-
tures for bot and gender detection, although different semantic features could be used.
The topic modelling task could also be combined with the emotions used in the tweets.


6   Acknowledgements

This work was supported by the National Research, Development and Innovation Of-
fice of Hungary through the Artificial Intelligence National Excellence Program (grant
no.: 2018-1.2.1-NKP-2018-00008). Furthermore this research was also supported by
the project "Integrated program for training new generation of scientists in the fields of
computer science", no EFOP-3.6.3-VEKOP-16-2017-0002. The project has been sup-
ported by the European Union and co-funded by the European Social Fund.
References
 1. Allcott, H., Gentzkow, M.: Social media and fake news in the 2016 election. Journal of
    Economic Perspectives 31(2), 211–236 (2017)
 2. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (Oct 2001),
    https://doi.org/10.1023/A:1010933404324
 3. Cai, C., Li, L., Zengi, D.: Behavior enhanced deep bot detection in social media. In: 2017
    IEEE International Conference on Intelligence and Security Informatics (ISI). pp. 128–130.
    IEEE (2017)
 4. Chen, Z., Lu, X., Ai, W., Li, H., Mei, Q., Liu, X.: Through a gender lens: Learning usage
    patterns of emojis from large-scale android users. In: Proceedings of the 2018 World Wide
    Web Conference. pp. 763–772. WWW ’18 (2018)
 5. Chou, W.y.S., Hunt, Y.M., Beckjord, E.B., Moser, R.P., Hesse, B.W.: Social media use in the
    United States: Implications for health communication. J Med Internet Res 11(4) (Nov 2009)
 6. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P., Specht,
    G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.: Overview of
    PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and
    Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller, H.,
    Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the Tenth Interna-
    tional Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019)
 7. Daneshvar, S., Inkpen, D.: Gender Identification in Twitter using N-grams and LSA: Note-
    book for PAN at CLEF 2018. In: CEUR Workshop Proceedings. vol. 2125 (2018)
 8. Dickerson, J.P., Kagan, V., Subrahmanian, V.S.: Using sentiment to detect bots on twitter:
    Are humans more opinionated than bots? In: Proceedings of the 2014 IEEE/ACM Inter-
    national Conference on Advances in Social Networks Analysis and Mining. pp. 620–627.
    ASONAM ’14, IEEE Press, Piscataway, NJ, USA (2014)
 9. Ferrara, E.: Disinformation and social bot operations in the run up to the 2017 french presi-
    dential election. First Monday 22 (06 2017)
10. Ferrara, E., Varol, O., Menczer, F., Flammini, A.: Detection of promoted social media cam-
    paigns. In: tenth international AAAI conference on web and social media (2016)
11. Flesch, R.: A new readability yardstick. Journal of Applied Psychology 32(3), 221–233
    (1948)
12. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning
    and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (Aug 1997),
    http://dx.doi.org/10.1006/jcss.1997.1504
13. Hall, A., Terveen, L., Halfaker, A.: Bot detection in wikidata using behavioral and other
    informal cues. Proc. ACM Hum.-Comput. Interact. 2(CSCW), 64:1–64:18 (Nov 2018)
14. el Hjouji, Z., Hunter, D.S., des Mesnards, N.G., Zaman, T.: The impact of bots on opinions
    in social networks. CoRR abs/1810.12398 (2018), http://arxiv.org/abs/1810.12398
15. Honnibal, M., Montani, I.: spacy 2: Natural language understanding with bloom embeddings,
    convolutional neural networks and incremental parsing. To appear (2017)
16. Kim, T., Wurster, K.: Emoji for python. https://pypi.org/project/emoji/ (2019)
17. Miller, G.A., Leacock, C., Tengi, R., Bunker, R.T.: A semantic concordance. In: Proceedings
    of the Workshop on Human Language Technology. pp. 303–308. HLT ’93, Association for
    Computational Linguistics, Stroudsburg, PA, USA (1993)
18. N. Howard, P., Kollanyi, B.: Bots, #StrongerIn, and #Brexit: Computational propaganda dur-
    ing the UK-EU referendum. SSRN Electronic Journal (06 2016)
19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
    Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
    M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine
    Learning Research 12, 2825–2830 (2011)
20. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the re-
    producibility of PAN’s shared tasks:. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson,
    M., Hall, M., Hanbury, A., Toms, E. (eds.) Information Access Evaluation. Multilinguality,
    Multimodality, and Interaction. pp. 268–299. Springer International Publishing (2014)
21. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architec-
    ture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World -
    Lessons Learned from 20 Years of CLEF. Springer (2019)
22. Rangel, F., Rosso, P., Franco, M.: A low dimensionality representation for language variety
    identification. In: Proceedings of the 17th International Conference on Intelligent Text Pro-
    cessing and Computational Linguistics (CICLing ’16), Springer-Verlag, LNCS(9624). pp.
    156–169 (2018)
23. Rangel, F., Rosso, P.: Overview of the 7th author profiling task at PAN 2019: Bots and gender
    profiling. In: Cappellato, L., Ferro, N., Müller, H., Losada, D. (eds.) CLEF 2019 Labs and
    Workshops, Notebook Papers (2019)
24. Rokach, L.: Ensemble-based classifiers. Artificial Intelligence Review 33(1), 1–39 (Feb
    2010)
25. Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5(2), 197–227 (Jul 1990),
    https://doi.org/10.1023/A:1022648800760
26. Senel, L.K., Utlu, I., Yücesoy, V., Koç, A., Çukur, T.: Semantic structure and interpretability
    of word embeddings. CoRR abs/1711.00331 (2017), http://arxiv.org/abs/1711.00331
27. Taulé, M., Martí, M.A., Pardo, F.M.R., Rosso, P., Bosco, C., Patti, V.: Overview of the task
    on stance and gender detection in tweets on catalan independence. In: Proceedings of the
    Second Workshop on Evaluation of Human Language Technologies for Iberian Languages
    (IberEval 2017) co-located with 33th Conference of the Spanish Society for Natural Lan-
    guage Processing (SEPLN 2017), Murcia, Spain, September 19, 2017. pp. 157–177 (2017)
28. Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., Dyer, C.: Evaluation of word vector repre-
    sentations by subspace alignment. In: Proc. of EMNLP. pp. 2049–2054 (2015)
29. Vinayakumar, R., Kumar, S.S., Premjith, B., Poornachandran, P., Padannayil, S.K.: Deep
    stance and gender detection in tweets on catalan independence@ibereval 2017. In: Proceed-
    ings of the Second Workshop on Evaluation of Human Language Technologies for Iberian
    Languages. pp. 222–229 (09 2017)
30. Yang, K.C., Varol, O., Davis, C., Ferrara, E., Flammini, A., Menczer, F.: Arming the public
    with Artificial Intelligence to counter social bots. Human Behavior and Emerging Technolo-
    gies p. e115 (02 2019)