-

Tweets Classification Using Corpus Dependent Tags, Character and POS N-grams

Carlos E. González-Gallardo

Azucena Montes

Gerardo Sierra

J. Antonio Núñez-Juárez

Adolfo Jonathan Salinas-López

Juan Ek

0 0 Grupo de Ingeniería Lingüística, Instituto de Ingeniería , UNAM , México

This paper is part of the Author Profiling task at PAN 2015 contest; in witch participants had to predict the gender, age and personality traits of Twitter users in four different languages (Spanish, English, Italian and Dutch). Our approach takes into account stylistic features represented by character Ngrams and POS N-grams to classify tweets. The main idea of using character Ngrams is to extract as much information as possible that is encoded inside the tweet (emoticons, character flooding, use of capital letters, etc.). POS N-grams were obtained using Freeling and certain token were relabeled with Twitter dependent tags. Obtained results were very satisfactory; our global ranking score was of 83.46%.

Author Profiling focused on Internet texts has been growing for the last years, one of the reasons is the big amount of information produced every minute in social networks or blogs. These Internet texts have their own characteristics that are hardly comparable with literary texts, documentaries or essays; this is because of the necessity of having a quick communication and the liberty of publishing unrevised content. For March 2015, Facebook reported having about 936 million daily active users on average [ 1 ].

As part of the PAN 2015 contest, this year, the Author Profiling (AP) task was about tweets in Spanish, English, Italian and Dutch [ 2 ]; for Spanish and English, the objective was to predict gender, age and personality traits of a Twitter [ 3 ] user. Moreover, for Italian and Dutch was only needed to predict gender and personality traits. Twitter has its own rules and characteristics that users explode to express themselves and communicate to each other. These rules can be extracted to create dependent tags (corpus dependent tags) that will help the classifier to improve its performance.

Dataset

The dataset provided this year consisted of tweets in Spanish, English Italian and Dutch. Regard to gender, the corpus was balanced in all four languages (50% of tweets were label as “female” and the other half as “men”).

Language Spanish English Italian Dutch

For the case of Spanish and English, age classes were defined in four groups (18-24, 25-34, 35-49 and 50-xx). In this case the corpus was not balanced, having a lot of “25-34” samples (around 40%) and just a few samples for “50-xx” (around 10%). There were five personality traits to predict: extroverted, stable, open, conscientious and agreeable; each one of them with a possible value between -0.5 and +0.5. It is important to mention that the samples for personality traits were totally imbalanced. For example: in Italian, for the conscientious personality trait there were just 5 labels of the 11 possible ones (-0.5, -0.4, … , +0.4, +0.5), and the number of samples of these existing labels varied a lot. One of the main characteristics of the AP task for PAN 2015 was that it was not about classifying tweets but classifying Twitter users based on a group of their tweets; let us call each one of these groups a document. So based on this fact, the vectors for the training algorithm and the vector for each one of the tests were a document-vector formed by a group of individual tweets; considering this, if a tweet can have a maximum length of 140 characters and in average each document was about 100 tweets, the length of each document could be about 14000 characters; a length quite acceptable for extracting a good number of features [ 4 ].

Because of the nature of the task that involved four different languages and our idea of making the algorithm the most language independent as possible, it was going to be difficult and unpractical to use content-based features; so we opted for the stylistic features and just keep it simple. It is possible to divide the used features in two groups: character N-grams and POS N-grams.

Using character N-grams could be seen very basic and naive, but it has shown to be very useful and practical in previous experiments [ 5 ]. Besides the usefulness shown in the past by the character N-grams, implicitly a great amount of stylistic features are extracted; for example: if 3-grams are used with in a document the frequency of all punctuation marks, characters flooding (!!!, ???, …, etc.), word flexion and derivation, diminutives, superlatives and prefixes are being extracted [6,7].

Regard to the POS N-grams, it is possible to obtain the grammatical sequence of the writer. For the POS tagging we used Freeling [8] with its corresponding configuration for Spanish, English and Italian; for Dutch, Freeling doesn't have a module so in this case we set up a basic assumption: if we use the English Freeling module, errors are going to be performed; but if these errors follow a certain stable pattern, it is possible that some grammatical information is being extracted from the tweets. Depending of the input language to analyze, our software selects the best configuration of extraction parameters (based on a series of tests) that maximize its own performance. These configurations take into account the number of N in the character Ngrams (num_gramas) and POS N-grams (num_POS), if they are retroactive or not (retro_gramas and retro_POS), the N-gram representation (modo) and if the N-grams should be represented in a logarithmic scale (frec_log).

GA P GA P num_gramas num_POS retro_gramas retro_pos 3 3 1 1 3 3 1 1 2 3 1 1 3 3 1 1

Dutch

For Spanish and English the goal was to predict age and gender of a Twitter user, so we decided to classify both characteristics at the same time getting the next 8 possible classes: _F_20s, _F_30s, _F_40s, _F_50s, _M_20s, _M_30s, _M_40s and _M_50s. Age groups are explained in the next table. For Italian and Dutch two classes where just created: _F and _M. The training phase is divided in 5 entities: extraction, labeled, POS generation, vectors creation and training.  Extraction: The truth file (truth.txt) is read and analyzed. One file is created for each one of the possible classes: class file. The tweets are preprocessed based on the substitution rules showed in Table 7; hashtags are not preprocessed because we consider they can provide stylistic information. Note: Each one of these class files is separated by documents (group of tweets of an author).

Labeled: For each class file created by the extraction phase, a local instance of Freeling is called to obtain a JSON [9] list1 that contains each one of the tweets of each one of the documents.

POS generation: Once the JSON list is obtained, the POS generation phase creates a POS file with the same structure of the class file and relabels certain tokens (adding our own tags) that are needed to extract extra grammatical information. Vectors creation: Character N-grams and POS N-grams are extracted from each document from each class file and POS file respectively based on the extraction parameters (Table 5.) to produce the document-vectors. Once all the documentvectors are created, a general features-vector is generated and each documentvector is expanded to the features-vector length obtaining the features-matrix that will be used to train the system.

Training: The features-matrix is now passed to the learning algorithm to train the system. The algorithm we used to classify age, gender and personality traits is an implementation of a Support Vector Machine (SVM) with a linear kernel called LinearSVC [10]. Once the system has been trained, the learning model and the features-vector is serialized and saved into disk for their later use. The test phase is also divided in 5 entities: extraction, labeled, POS generation, vector creation and testing. 1 The Freeling instance is called using an interface that converts the output of Freeling into a JSON string. This interface was developed by Grupo de Ingeniería Lingüística, Instituto de Ingeniería, UNAM     

Extraction: The xml file of the Twitter user is read and processed based on the substitution rules mentioned in (Table 7.) obtaining a preprocessed text file; hashtags are not preprocessed because we consider they can provide stylistic information.

Labeled: The preprocessed file done by the extraction phase is passed to a local instance of Freeling to obtain a JSON string.

POS generation: Once the JSON string is obtained, the POS generation phase creates a POS file with the same structure of the preprocessed file and relabels certain tokens (adding our own tags) that are needed to extract extra grammatical information (Table 8.).

Vector creation: Character N-grams and POS N-grams are extracted from the preprocessed file and the POS file respectively based on the extraction parameters (Table 4.) to produce a document-vector. Then the features-vector is loaded to expand the document-vector.

Testing: The learning model created by the training phase is loaded so that the document-vector can be tested. Once all 6 classifications (gender/gender_age, extroverted, stable, agreeable, conscientious and open) are done the output xml file is created and written to disk. 5

Results

Two measures were used to evaluate the submissions: accuracy was used to measure age and gender, leaving Root Mean Squared Error (RMSE) to measure personality traits. For the case of Spanish and English, an average between age and gender was performed (both). G If we compare our approach against the other two best participants (tables 7.1-7.4), ours is slightly slower because of the grammatical analysis made by Freeling. Each one of the corpus languages had different number of samples and different distribution of data, so maybe it is not objective to compute a global average but we think is important to do this because the task involved analyzing all four languages; so we took the freedom of computing it obtaining a global score of 83.46%. Some extra information related to our results is presented in Section 8. Extended Re

Conclusions

Internet has made of communication something very quickly and fluid; example of these is the social network Twitter, in which users have to transmit a complete message in just 140 characters. To accomplish this, it is necessary to compact as much information as possible, making each one of the tweets a dense text (short text with a lot of information).

The use of N-grams of characters and POS N-grams, as shown in the results, is a good was possible to extract emoticons, exaggeration of punctuation marks (character flooding), use of capital letters and all kind of emotional information encoded in the tweet. With POS N-grams, in Spanish and English we were able to capture the most representative series of two and three grammatical elements; in Italian and Dutch we were able to capture the most frequent grammatical elements. Our approach showed to be good for the gender classification task but not too good for age classification. We will focus in finding some characteristics that we are probably missing and try to give them more emphasis to make them more representative. Probably is a good idea to separate the classification and just classify age and gender separately.

Extended Results

p l a c e 1 2 3 G L O B A L 0.7906 0.7740 0.7489 00:00:59 00:06:29 00:03:15 a g e a g e alvarezcarmona15 gonzalesgallardo15 teisseyre15 a g e a g e 0.8658 0.8295 0.8260 G L O B A L 00:01:31 00:00:29 00:00:01 Acknowledgements: This work was funded by project CONACyT-México No. 215179 “Caracterización de huellas textuales para el análisis forense”. 6. Stamatatos, E. (2006, August). Ensemble-based author identification using character n-grams. In Proceedings of the 3rd International Workshop on Text-based Information Retrieval (pp. 41-46). 7. Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3), 538-556. 8. Freeling, http://nlp.lsi.upc.edu/freeling/ 9. JSON (JavaScript Object Notation), http://JSON.org/ 10. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825-2830.

1. Facebook. «Facebook newsroom», http://newsroom.fb.com/company-info

2. Rangel , F. , Rosso , P. , Potthast , M. , Stein , B. , Daelemans , W. : Overview of the 3rd author profiling task at pan 2015 . In: Cappellato L., Ferro

, Gareth

and San Juan E. (Eds). (Eds.) CLEF 2015 Labs and Workshops, Notebook Papers . CEURWS.org, ( 2015 )

3. Twitter , Inc., https://twitter.com

4. Schwartz , R. , Tsur , O. , Rappoport , A. , & Koppel , M. ( 2013 ). Authorship Attribution of Micro-Messages . In EMNLP (pp. 1880 - 1891 ).

5. Doyle , J. , & Kešelj , V. ( 2005 ). Automatic Categorization of Author Gender via NGram Analysis . In The 6th Symposium on Natural Language Processing , SNLP.