Syntactic N-grams as Features for the Author Profiling Task Notebook for PAN at CLEF 2015 Juan-Pablo Posadas-Durán, Ilia Markov, Helena Gómez-Adorno, Grigori Sidorov, Ildar Batyrshin, Alexander Gelbukh, and Obdulia Pichardo-Lagunas Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico City, Mexico http://www.cic.ipn.mx/~sidorov Abstract This paper describes our approach to tackle the Author Profiling task at PAN 2015. Our method relies on syntactic features, such as syntactic based n-grams of various types in order to predict the age, gender and personality traits that has the author of a given text. In this paper, we describe the used features, the employed classification algorithm, and other general ideas concerning the experiments we conducted. 1 Introduction The Author Profiling task consists in identifying author’s personality features based on a sample of the author writing. This challenging task has a growing importance in sev- eral applications related to forensics, security, and terrorism prevention: identifying the author of a suspicious text. Also for marketing purposes, the identification of author’s profile proved to be useful for better market segmentation. This year, in PAN 2015, the task consisted in prediction of the age, gender, and personality traits of authors based on their published tweets. Thus, the participants were provided with tweets in English, Spanish, Italian, and Dutch in order to extract the information concerning author’s personality out of them. To perform the task we used syntactic n-grams (the concept is introduced in detail in [12,9,10]) of various types (words, POS tags, syntactic relations, etc.) along with other features such as frequencies of emoticons, hashtags, retweets and others. Syntactic n-grams differ from traditional ones in the way that the neighbors are taken by following syntactic relations in syntactic trees, while in traditional n-grams, the words are taken from the surface strings, as they appear in a text [12,9,10]. The application of syntactic n-grams gives better results than using traditional ones for the task authorship attribution [12,7]. This makes it important to study its impact in the author profiling task. The paper is structured as follows: Section 2 introduces the proposed approach, Section 3 presents the results of our work, and Section 4 draws the conclusions and points to the future work. 2 Methodology Presented approach considers the task of Author Profiling as a multilabel classification problem, where an instance is associated with seven labels. The set of labels was defined by the committee of PAN 2015; it consists of features related to the personal traits of an author. Two of these labels, gender and age, were used at PAN 2014 while the rest of the labels (open, agreeable, conscientious, extroverted, and stable) were added in this new edition of the competence and measure some aspects of author’s behavior assigning a value that varies from −0.5 to +0.5, where the positive extreme means a very strong presence in the author’s behavior, while the negative extreme implies the opposite. Our method uses a supervised machine learning approach, where a classifier is trained independently for each label. In this way, the prediction for an instance is the union of the outputs of each classifier. The vector space model was used to represent the tweets of an author and introduce the use of syntactic n-grams as markers of personality along with the use of traditional SVM classifiers. Data representation and feature selection details are presented in the following sub- sections. 2.1 Syntactic N-grams The main motivation behind our approach is the use of syntactic n-grams as markers to model author’s features. There are different types of syntactic n-grams depending on the information used for their construction (lemmas, words, relations, and POS tags); all of them are related through a dependency tree but explore different linguistic aspects of a sentence. We use ten different types proposed in [7], so that the most information from the dependency tree is used. A syntactic parser is required for our approach, since it allows constructing syn- tactic n-grams from dependency trees. Different syntactic analyzers where used: Stan- ford CoreNLP [5] for the English dataset, FreeLing [6,1] for the Spanish dataset, and Alpino1 for the Dutch one. We do not present the results for the Italian dataset, since we were not able to find a syntactic parser publicly available for this language. The size of n-grams is another important aspect to be considered. In this proposal, we use the sizes in range between 3 and 5, because several studies related to the use of general n-grams in authorship attribution demonstrated that particularly these sizes correspond to the most representative features [13,2,3]. We perform a standard preprocessing over each dataset before it is parsed by a respective parser. In the preprocessing phase, we also extract several specific charac- teristics of tweets such as number of retweets, frequency of hashtags, frequency of emoticons, usage of referencing urls and treat them as features. In the preprocessing phase, the sentences to be parsed are selected depending on their size, so the criteria concerning the limitations on the size of syntactic n-grams are satisfied. We also treat in a specific way the sentences whose size is less than 5, since they provide only a few syntactic n-grams and are generally related to expressions that parsers do not process well. 1 See http://www.let.rug.nl/vannoord/alp/Alpino/ 2.2 Data Representation While using a vector space model approach, an instance is represented as a vector space, in which each dimension corresponds to a specific syntactic n-gram and the value is its frequency. Let’s suppose that {d1 , . . . , dn } are the instances in the training corpus and {s1 , . . . , sm } are different syntactic n-grams. We build the vectors vj = hf1j , . . . , fij i, where fij represents the frequency of the syntactic n-gram si inside the instance dj . As in the case with traditional n-grams, syntactic n-grams also suffer from noise, since many of them appear only once, and therefore these rare features may not be useful to build author’s profiles. In order to reduce the noise in the training dataset, we perform chi-square test as a feature selection strategy, which proved to give good results for the Information Retrieval task [14,8]. The chi-square test measures the importance of a feature for a specific class. Let’s suppose that s = {s1 , . . . , sm } are different syntactic n-grams, and c = {c1 , . . . , ck } are all possible classes for a specific label. The chi-square with one degree of freedom assigns a score to the syntactic n-gram according to the following equation 1[4]: 2 (N11 + N10 + N 01 + N00 ) ∗ (N11 N00 − N10 N01 ) χ2 (si , cj ) = , (1) (N11 + N01 ) ∗ (N11 + N10 ) ∗ (N10 + N00 ) ∗ (N01 + N00 ) where N11 means the number of instances, in which si occurs in class cj ; N01 means the number of instances, in which si does not occur in class cj ; N10 means the number of instances, in which si occurs out of the class cj ; and N00 means the number of instances, in which neither si nor cj occur. The chi-square test with one degree of freedom transforms the space into a binary class space. Thus,  for this task, where the number of classes is greater than two, we take max χ2 over the different classes and select those whose score is greater than a certain threshold θ. The final set of features for a specific label is the union of all the selected features via the chi-square test. Based on this, we train the SVM classifier using rbf kernel and typical normalization of vectors. The procedure is repeated for each label, and then each classifier is trained for each label. 3 Results Our approach greatly depends on the use of syntactic parsers that construct depen- dency trees. While implementing our proposal, we could only find syntactic parsers for English, Spanish and Dutch. Therefore, the results were obtained only for these three languages (table 1). Our approach showed a relatively good performance for the Dutch language; however, the results for English and Spanish are not that high. Our global results are not as high as the of the other systems. Analyzing the rea- sons for this performance, we saw that the main problem is in predicting the age and gender, while for the personal traits (RMSE) the results are comparable with the rest of competitors. Table 1. Results of our approach at PAN 15 competence Language GLOBAL age gender RMSE English 0.5890 0.5845 0.5915 0.1882 Spanish 0.5874 0.5114 0.6591 0.2116 Dutch 0.6798 – 0.5313 0.1716 4 Conclusion In this paper, we presented our approach for the Author Profiling task at PAN 2015. The main contribution of the approach is that it shows that syntactic n-grams can be used as features to model author’s aspects such as gender, age and personal traits. Consid- ering syntactic n-grams as dimensions in a vector space model and using a supervised machine learning approach, it is possible to tackle the problem of Author Profiling. The preliminary results show that the use of syntactic n-grams along with other specific tweet features (such as number of retweets, frequency of hashtags, frequency of emoticons, and usage of referencing urls) gives good results when predicting personal traits; however, their usage is not that successful when predicting the age and gender. As our approach exploits information contained in the dependency trees, its per- formance is influenced by the use of external syntactic parsers. Although most of the syntactic parsers have recently undergone important improvements, they still have sev- eral problems concerning the noise data analysis. The use of the external tools adds noise to the data, and this is one of the reasons why our approach did not show very good results when processing tweets. In order to improve the approach, we propose the following steps: (1) to add new heuristics to handle grammatical mistakes in tweets instead of ignoring them, (2) to use a weight scheme that will help the approach to handle imbalance training data, (3) to combine the proposed features with other features of distinct nature (semantic features, lexical features, among others), and (4) to use the soft cosine measure [11] in order to consider the similarity between the pairs of syntactic n-grams. Acknowledgments. This work was supported by project Conacyt 240844 and projects SIP-IPN 20151406, 20144274. References 1. Carrera, J., Castellón, I., Lloberes, M., Padró, L., Tinkova, N.: Dependency grammars in freeling. Procesamiento del Lenguaje Natural (41), 21–28 (September 2008) 2. Escalante, H.J., Solorio, T., Montes-y Gómez, M.: Local histograms of character n-grams for authorship attribution. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. pp. 288–298. Asso- ciation for Computational Linguistics (2011) 3. Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the conference pacific association for computational linguis- tics, PACLING. vol. 3, pp. 255–264 (2003) 4. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval, vol. 1. Cam- bridge university press Cambridge (2008) 5. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. pp. 55–60 (2014), http://www.aclweb.org/anthology/P/P14/P14-5010 6. Padró, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2012). ELRA, Istanbul, Turkey (May 2012) 7. Posadas-Duran, J.P., Sidorov, G., Batyrshin, I.: Complete syntactic n-grams as style markers for authorship attribution. In: LNAI, vol. 8856, pp. 9–17. Springer (2014) 8. Sebastiani, F.: Machine learning in automated text categorization. ACM computing surveys (CSUR) 34(1), 1–47 (2002) 9. Sidorov, G.: Non-continuous syntactic n-grams. Polibits 48(1), 67–75 (2013) 10. Sidorov, G.: Should syntactic n-grams contain names of syntactic relations. International Journal of Computational Linguistics and Applications 5(1), 139–158 (2014) 11. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine mea- sure: Similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014) 12. Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Systems with Applications 41(3), 853–860 (2014) 13. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60(3), 538–556 (2009) 14. Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter 6(1), 80–89 (2004)