Introduction

UNED at CLEF RepLab 2014: Author Pro ling

Jacinto Jesus Mena Lomen~a

Fernando Lopez Ostenero jmena

@alumno.uned.es flopez@lsi.uned.es

0 0 UNED NLP & IR Group Juan del Rosal , 16 28040 Madrid , Spain

1537 1546

This paper describes a learning system developed for the RepLab 2014 author pro ling task at UNED. The system uses a voting model, which employs a small set of features based mainly on the tweet text information such as POS tags, number of hashtags or number of links. In the uno cial run, the feature set was increased with Twitter metadata such as number of followers or retweet speed. The system achieved good results in author categorisation, although its performance in author ranking was low.

Introduction

This paper describes the participation of UNED in RepLab 2014 where we tackled the author pro ling task focused on classifying and ranking Twitter pro les using their tweet streams.

Twitter constitutes one of the main sources of data relevant for online reputation management because of the spontaneity and immediacy. Although not all the tweets have the same impact. The way in which a post may a ect the reputation of a company often depends on who published it. The author proling task aims at classifying authors by type of their activity and identifying the in uential ones, those whose tweets are more likely to propagate quickly and widely through the network and to produce a greater e ect. So the nal goal is to build a ranking list of the selected Twitter pro les.

The paper is organised as follows. The applied approach is introduced in Section 2 brie y describing the features considered and the learning process. Sections 3 explains the con gurations of the model for author categorisation and author ranking. In Section 4, we report the results obtained for each subtask. Finally, in Section 5, we conclude and outline possible improvements of the system in the future. 2.1

Features

The model uses the following set of features: Bag of Words: a feature set based on a Weka lter called StringtoWordVector was built. It contains a vector of occurrences of words in a document. We used the default con guration of this Weka lter.

This feature is important to determine the most important words which decide the classi cation of Twitter pro les in the Author Categorisation subtask. This feature could be more discriminant if it is used taking in consideration the domain information to divide the classi cation algorithm.

Number of sentences: The system used GATE [ 1 ] with the SentenceSplitter resource to get a feature with the number of sentences. We used a speci c SentenceSplitter for each language, one for English and other one for Spanish. POS information: Seven features were built based on the POS tags. We used the GATE POS Tagger with the OpenNLP framework and di erent models for each language. Before running the POS tagging, we preprocessed the tweet contents to remove hashtags, mentions, and URLs, using regular expressions. After getting the POS tags, we considered a set of the following features that exploit the number of adverbs, verbs, adjectives, nouns, pronouns, foreign words, and abbreviations. This follows the previous work by [ 7 ] where the number of POS elements were considered for measuring polarity. These features, in our opinion, could characterise the author's writing style and could be useful useful in author categorisation.

Number of links: We have built a regular expression method to count the number of links in the tweet.

Similar to the point above, we consider this feature useful for the author categorisation subtask, because it re ects stylistic characteristics of the user's writing.

Number of hashtags: Following [ 3,4 ], we included a process based on regular expressions to count the number of hashtags.

The hypothesis is that the number of hashtags could be indicative of the relevancy of a tweet, as the more hashtags there are, the more topics will be involved.

Number of mentions: Again, based on the work in [ 3 ], we included the count the number of explicit mentions of users of the form user.

For instance, for the following tweet it would be generated the value of 6 mentions for this feature:

still waiting on @MeganBerry's #fbumpf contribution :) kevinGEEdavis @MerlinUWard @MimiOrtega @jeremarketer @AmyVernon @IAmMrSid Number of smileys: The system considered the number of smileys, based on the experience of [ 2 ]. In order to count smileys, we manually built a dictionary using information extracted from Wikipedia.

Buenos d as :) A por un fin de semana incre ble lleno de color amigs ;) http://ow.ly/i/2EXp7 Language: We used the language label provided by the Replab 2014 organisers as a feature of the classi er.

This feature is used mainly to determine the set of words to be considered as Bag Of Words.

In the uno cial run, we included two new features, based on Twitter metadata. For that, we used Twitter4J, a Java Wrapper for Twitter REST API. We built the following new features: Number of followers: For each pro le, we queried Twitter about the number of followers of every pro le in the training and test data sets.

The idea was to use this feature in the Author Ranking substask to generate weight values the application of which is described below.

Retweet speed: We examined the last retweet of each author. The retweet speed was calculated as follows using the creation date, number of retweets and the creation date of the last retweet: avgT ime = (LastRT CreationT ime T weetCreationT ime)

N umberOf RT In order to sort elements, we built a weight measure which was calculated using the following formula: (1) (2) weight =

N umberof F ollowers

AverageRT speed

This formula tries to relate the retweet speed with the number of followers. The aim is to capture those cases when, given two pro les, for instance, one with 1,500 followers and the other with 1,600, the former has more activity in terms of tweets propagation and retweet speed than the latter. So the underlying hypothesis is that it is more relevant a pro le with a smaller number of followers and higher speed, than a pro le with a bigger number of followers and lower speed. One run was con gured with this weight parameter. Regarding this feature, the bigger the weight value is, the more important is a pro le.

Due to Rate limiting, we only managed to obtain retweet speed information for about 50% of pro les. In order to use it as a feature, an empty value for the feature was taken to build the classi er for the Author Category subtask. For Author Ranking, an average speed was assigned, multiplied by the number of followers. 2.2

Learning Process and Con dence Methods

The learning process of our system is composed of a voting system, a set of classi ers and a method to resolve the ties by means of con dence scores.

We divided the training data set into 5 subsets, each containing 20% of data. 601 tweets provided by the organisers with each pro le were also split in ve parts. The classi ers were trained considering each tweet as an instance instead of grouping all the data related to one pro le in one instance. Four of the subsets were used to train the system employing the following Weka algorithms: { ZeroR Algorithm { RandomTree Algorithm [ 5 ] { RandomForests Algorithm [ 8,6 ] { Nave Bayes Algorithm These four algorithms allowed covering 80% of the data set. The remaining 20% was used to create a con dence score table.

That training set partition had nearly 300,000 tweets. We iterate tweet per tweet and stored (in a relational database) 4 rows per each tweet as con dence information. As result of that we had a table with close to 1,200,000 (per each Replab 2014 subtask) rows to query information about con dence. The following formula was used to solve those cases when at least three classi ers decided the same: conf idence(cat; algs) = X alg2algs nRightClassif ication(cat; alg) nClassif ications(cat; alg) (3)

Where cat is the category for which the con dence value has to be calculated and algs is a set of algorithms the result of which was the category cat. nRighClassif ication is a function with the number of correct classi cations for this category produced by this algorithm, and nClassif ications is a function which counts the number of classi cations for that category.

The con dence scores are used to decide which category is more plausible after training. Figure 1 reproduces the architecture of the con dence score component. This gure shows how the con dence scores table is populated with the outcomes of the algorithms, based on the training data.

Figure 2 illustrates how the con dence score information is used to disambiguate the results and decide which class value should be assigned to a pro le. 3

Algorithms

In this section, we describe the algorithm con gurations. Table 1 provides an overview of the Author Categorisation algorithms, specifying the kind of data used in each of them. \ AC" in the runs identi ers indicates the \Author Categorisation task", while \ AR" stands for \Author Ranking". 3.1

Author Categorisation

Basic con guration This is the rst and the simplest system con guration (ORM UNED AC 1) for author categorisation that consists only of classi cation algorithms without taking into account information about the domain. The 4 classi ers were fed with a small set of features which included BoW, POS, hashtags, mentions, links, smileys, and language. The classi cation result was obtained by applying a basic voting algorithm using majority rule.

In order to avoid the bias towards the most frequent class (Undecidable), a threshold was applied. The majority class label (Undecidable) was assigned only if it was supported by 80% or more votes. Below that threshold (80%), we classi ed the pro le as another majority class, distinct from Undecidable.

We used 4 classi ers which classi ed a pro le tweet by tweet. For each pro le, we generated 4 class values per tweet, producing near 2400 class values per pro le. This information was used to obtain the majority result of the voting algorithm.

Basic con guration with domain features This con guration

(ORM UNED AC 2) includes information about the pro le domain. Algorithms were de ned to consider the domain element and decide which algorithm should be used. The same set of features as in the basic con guration, although choosing di erent classi ers depending on the domain.

As before, we used a threshold to avoid the bias towards the most frequent class (Undecidable), setting it at the same value. This con guration produced 8 classi ers.

Con dence scores model This con guration used information about con dence of classi ers algorithms when their results are close to a tie. We submitted the results of this con guration as ORM UNED AC 3.

The con dence information was used to decide the outcome of the classi cation. In case of a tie, we calculated con dence scores using the equation 3. We used the same feature set and threshold as in the basic con guration.

Con dence scores and social information model We built a last con gu

ration using a new kind of information, social information (ORM UNED AC 4) after the o cial deadline for submitting results.

This con guration, for which we can report an uno cial result, is similar to the simple con dence score model described above, but using two new features: number of followers and retweet speed.

We applied to the annotations with the Undecidable class the same threshold as in the basic con guration. 3.2

Author Ranking

For the Author Ranking subtask, we submitted one o cial run: ORM UNED AR 3 (see Table 1). The developed algorithm is described below.

Basic con guration We used the following features: { Class value of opinion maker/non opinion maker { Number of followers { Retweet speed

The weight function de ned in Equation 2 was used to sort the ranking results. 4

Results

The test set contained three domains. We employed two domains and in order to assign a value to the third class, we selected one of the classi ers built using the training dataset. Tables 3, 4, 5 report the scores obtained for the evaluation metrics used in the author category subtask: Reliability (R), Sensitivity (R) and F1(R; S) for each domain. For the automotive and banking domains we also include scores of the baselines for reference. We described the algorithms submitted to the RepLab 2014 Author Pro ling task, where we tackled both author categorisation and author ranking.

Author categorisation was our main focus at RepLab 2014. We submitted three o cial and one uno cial run. Our proposal was based on a voting system featuring a method to calculate con dence scores to solve ties in votes. However, the results obtained with the con dence method were not as good as we expected, as they were surpassed by the basic con guration. Nevertheless, although the con dence method got the worst results in Average Accuracy, it turned out the best in F-measure not only among our runs, but also considering the rest of the Author Categorisation task participants.

Future work in author categorisation is going to focus on selecting new features and improving on the whole system in order to make processing more e cient. Furthermore, we will have to re ne the con dence formula to avoid setting a threshold for the majority \Undecidable" class.

Regarding author ranking, the bad results can be partly explained by the lack of information for building the ranking. Due to the Twitter Rate Limit, we failed in getting necessary information about the followers and retweet speed for all the pro les. So in case of pro les without this information, they were assigned an average value. This distortion might have a ected the system's outcome.

For author ranking, future work will focus on getting more information from Twitter, although the rst step, of course, will be to improve the query process to cope with the Twitter Rate Limit.

1. Cunningham , H. , Maynard , D. , Bontcheva , K. : Text processing with gate . Gateway Press CA ( 2011 )

2. Filgueiras , J. , Amir , S. : Popstar at replab 2013: Polarity for reputation classi cation

3. Greenwood , M.A. , Aswani , N. , Bontcheva , K. : Reputation pro ling with gate . In: CLEF (Online Working Notes/Labs/Workshop) ( 2012 )

4. Mart n , T., Spina , D. , Amigo , E. , Gonzalo , J.: Uned at replab 2012: Monitoring task

5. Meina , M. , Brodzinska , K. , Celmer , B. , Czokow , M. , Patera , M. , Pezacki , J. , Wilk , M. : Ensemble-based classi cation for author pro ling using various features

6. Mosquera , A. , Fernandez , J. , Gomez , J.M. , Mart nez-Barco, P. , Moreda , P. : Dlsivolvam at replab 2013: Polarity classi cation on twitter data . In: Working Notes of CLEF 2013 Evaluation Labs and Workshop ( 2013 )

7. Pang , B. , Lee , L. : Opinion mining and sentiment analysis . Foundations and trends in information retrieval 2 ( 1-2 ), 1 { 135 ( 2008 )

8. Saleiro , P. , Rei , L. , Pasquali , A. , Soares , C. , Teixeira , J. , Pinto , F. , Nozari , M. , Felix , C. , Strecht , P. : Popstar at replab 2013: Name ambiguity resolution on twitter