=Paper=
{{Paper
|id=Vol-2696/paper_230
|storemode=property
|title=Celebrity Profiling using Twitter Follower Feeds: Notebook for PAN at CLEF 2020
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_230.pdf
|volume=Vol-2696
|authors=Samantha Price,Abigail Hodge
|dblpUrl=https://dblp.org/rec/conf/clef/PriceH20
}}
==Celebrity Profiling using Twitter Follower Feeds: Notebook for PAN at CLEF 2020==
Celebrity Profiling using Twitter Follower Feeds Notebook for PAN at CLEF 2020 Abigail Hodge and Samantha Price Northeastern University hodge.ab@northeastern.edu, price.sam@northeastern.edu Abstract This paper describes our approach to completing the Celebrity Profil- ing shared task set forth by PAN at CLEF 2020. We discuss the features selected (including part-of-speech tags, named entity types, and word vectors) as well as the logistic regression, random forest and support-vector models we tested for this task. The resulting confusion matrices and evaluation scores are provided. 1 Introduction In this paper, we attempted to solve a natural language processing problem set forth by PAN as part of the organization’s 2020 competition. At a basic level, the problem required a celebrity profiling ML model that could estimate the age, occupation, and gender of a given celebrity based upon the tweets of their Twitter followers [20]. More specifically, the test input was a list of JSON objects representing the tweets of fol- lowers, and the output was a list of JSON objects where each object contained the ID of a given celebrity, their predicted occupation (among a possible list of ’sports’, ’per- former’, ’creator’, and ’politics’), their predicted birth year (between 1940 to 1999), and their predicted gender (’male’ or ’female’). The most unique aspect of this problem was that no information about a celebrity from their Twitter account or other sources was provided as input for the test dataset; the celebrities had to be profiled solely on the basis of their followers’ tweets. First, we discuss some related work done in this field, namely, PAN’s 2019 celebrity profiling task, which required competitors to profile celebrities based on their own tweets. Next, we discuss our methodology for feature extraction and model building. Finally, we explain the results of our models for age, gender, and occupation. An earlier version of this paper was submitted for an academic project at Northeastern University [7]. 2 Related Work This task is, on the surface at least, very similar to the 2019 PAN Celebrity Classifica- tion task [19]. This task required competitors to classify celebrities by their birth year Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa- loniki, Greece. (between 1940 and 2012), gender (male, female, or non-binary), fame (rising, star, or superstar) and occupation (sports, performer, creator, politics, manager, science, pro- fessional, religious) based on their tweets. This task differed from the 2019 task in a few key ways. First, the fame classification task was not present. Second, the remaining three tasks had fewer categories. Birth year was restricted from 1940-1999, the non- binary class was no longer present for the gender category, and the manager, science, professional, and religious classes were no longer present for the occupation category. However, we also had significantly fewer data points to work with. The 2019 task had 48,335 celebrities to use for training. The 2020 task had only 1,920. Furthermore, for the 2020 task, we used data from the celebrity’s followers, not the celebrities themselves. For our feature extraction, we built off of work done by Argamon et al. [1]. They examined which features were generally most useful for anonymous authorship profil- ing, and had a good deal of success with POS tags. We also built off of our own success with word embeddings [6] for an age profiling task. This will be discussed in greater depth in the next section. Generally, most of the submissions for the 2019 task seemed to find success using classical natural language processing and machine learning techniques. In fact, the three competitors in 2019 who attempted to use deep learning techniques reported that these techniques were not suited for the task. [19] Therefore, we chose to focus on models that do seem well suited: SVM, Logistic Regression, and Random Forest (discussed further in the Algorithms section). Our decisions about what algorithms to select were also influenced by our work on author and time period classification in another course [6]. 3 Approach Description 3.1 Feature Extraction For feature extraction, we decided to utilize features that have proven useful for author profiling problems in the past. Because we were not trying to profile the authors of the tweets, but rather a common person that all of these authors followed, we were required to make certain assumptions about the follower/followee relationship. We assumed that the followers of a celebrity might have similar interests to that celebrity, that is, a fol- lower of a politician might post a lot about politics, a follower of a performer might post a lot about music/concerts, etc. We also assumed that celebrities might attract fol- lowers that are largely of a similar age and gender. Essentially, we decided to treat the aggregate group of tweets as though they were authored by the person we were trying to profile. There are a few simple features that have proven highly effective for author profiling tasks, namely, stop words and part-of-speech (POS) tags. For example, Argamon et al. [1] found that men tend to use more determiners and prepositions, while women tend to use more pronouns, to the degree that these features are given significant weight in a machine learning model. Another effective feature appears to be n-grams [16] but we decided not to utilize n-grams as our dataset has a large vocabulary, and this would therefore require a lot of features to represent. We had relative success with word embeddings with an author profiling task involv- ing age, albeit on books rather than tweets [6]. Therefore, we decided to utilize them again for this project, averaging together the word embeddings for all words (in vocabu- lary) for a given celebrity’s tweets. Finally, looking at last year’s celebrity profiling task results, it appeared that occupation was generally the lowest scoring classifier [19] so we decided to add in features specifically to improve upon this classifier: named entity types. This was based on the logic that, for example, politicians would be more likely to talk about countries or organizations, creators would be more likely to talk about art, etc. However, it should be noted that adding named entity recognition to our pipeline significantly slowed down our feature extraction code—it took several minutes to run a single celebrity. However, we decided that the boost to classification was worth the extra runtime. Ultimately, we decided on the following features: POS tags, stop-word count, named entity types, average word vectors, tweet length (in characters), number of links, num- ber of hashtags, number of mentions, and number of emoji [10]. These were all nor- malized by total number of words in a celebrity’s tweets (with the exception of word vectors and average tweet length, since those were already averages). Feature extrac- tion of POS tags, stop words, NER types, and word vectors were done using the sPacy library [8]. Additional logic was executed using Numpy [12][17]. 3.2 Algorithms After extracting features, we decided upon three different machine learning algorithms to train on these features and compare the resulting metrics: logistic regression, random forest, and support-vector machine (SVM). The models were constructed through Sci- Kit Learn [13][3]. Each algorithm was implemented through three different models: one for occupation, one for gender, and one for birth year. Unique hyperparameters were defined for each model, and the optimized parameters were selected through 5-fold cross validation with scoring based on the macro f1 score (functionality also provided through Sci-Kit Learn [13][3]). Hyperparameter tuning resulted in higher f1 scores for all three chosen algorithms. For the logistic regression model, the parameters chosen for tuning were the type of regularization (L1 or L2), the penalty on regularization (0.01, 0.1, or 1) and the type of solver (liblinear or saga). An article written by Qiao (2019) [15] inspired the choice of hyperparameters for tuning. The optimized parameters for occupation were (L2, 0.1, saga), the parameters for gender were (L2, 0.1, liblinear), and the parameters for birth year were (L1, 1, saga). The selection of random forest and support-vector machine was inspired by the PAN 2019 celebrity profiling task, as these two models were proven to be successful [19]. The possible parameters for the random forest classifiers (based on Koehrsen, 2018 [11]) were number of estimators (50, 100, 500), max depth (None, 5, 10), and max features (auto or log2). Ultimately, the chosen parameters for training the random forest classifiers were (500, None, auto) for occupation, (500, None, auto) for gender, and (50, log2, 5) for birth year. Finally, regularization penalty, (0.01, 0.1, 1) kernel type (linear, poly, rbf), and gamma value (0.1, 1, 10) (Fraj, 2018 [4]) were chosen as the adjustable hyperparame- ters for the support-vector machine model. The best parameters were determined to be (0.1, 0.1, linear) for occupation, (0.1, 0.1, linear) for gender, and (0.01, 0.1, linear) for birth year. After running cross-validation and training all classifiers with the extracted features and optimized hyperparameters, metrics were determined for the different classifiers based on a section of the training data (20%) set aside for testing. Additionally, an alter- native f1-score (besides the one from Sci-Kit Learn [3]) for birth year was calculated, as PAN [20] dictated in the task description that any predicted year within a specific range of years would be considered correct (true birthyear -m < predicted birthyear < true birthyear + m); this alternative f1-score took this window of error into account. Finally, PAN [20] also mentioned that submissions to the competition would be judged based upon a special "cRank" metric that combines the f1-scores for occupation, gender, and birth year. Thus, the cRanks for logistic regression, random forest, and SVM were also calculated in this project (results below). 4 Results We defined baseline models using Sci-Kit Learn’s DummyClassifier class to compare to our trained models [3]. Figure 1 displays the results from the occupation classification model, Figure 2 shows gender classification, and Figure 3 shows birth year classifica- tion. The classification report is a heat map that demonstrates the precision, recall, and, f1 score for every possible class; lighter colors indicate lower scores and darker colors indicate higher scores. Due to the large range of birth years as possible classes, a visual- ization of the metrics could not be produced, but a textual description is provided. All of the subsequent visualizations were constructed through Yellowbrick [2] and Matplotlib [9]. Figure 1: Metrics for Baseline Occupation Classifier Figure 2: Metrics for Baseline Gender Classifier Figure 3: Metrics for Baseline Birth Year Classifier 4.1 Logistic Regression This section contains classification reports and confusion matrices for the occupation and gender logistic regression classifiers (Figures 4-7), as well as a textual description of metrics for the birth year classifier. The confusion matrices are also heat maps, where a darker square indicates more predictions and a lighter square indicates fewer predic- tions. These classifiers were trained with the hyperparameters mentioned in Section 3.2. As can be seen in the figures, the logistic regression occupation classifier significantly outperformed the baseline occupation classifier. Its highest f1-score was 0.832 for the politics class. Figure 4: Metrics for Logistic Regression Occupation Classifier Figure 5: Confusion Matrix for Logistic Regression Occupation Classifier The gender classifier was also more successful than the baseline, with a maximum f1-score of 0.792 (for the male label). Figure 6: Metrics for Logistic Regression Gender Classifier Figure 7: Confusion Matrix for Logistic Regression Gender Classifier Finally, the birth year classifier showed some improvement compared to the baseline classifier, although the number of possible classes and unbalanced training set made this classifier difficult to train. Its custom f1-score (taking the aforementioned window of error into account) was 0.346. Figure 8: Metrics for Logistic Regression Birth Year Classifier 4.2 Random Forest Figures 9-13 display the metric information derived from the occupation, gender, and birth year random forest classifiers. In this case, the metrics are clearly superior to those from the baseline model. In comparison to the logistic regression occupation model, the random forest occupation model had a slightly worse accuracy, but higher maximum f1-score (0.830 for politics). The random forest gender classifier had a worse accuracy (0.70) and maximum f1-score (male label) than the logistic regression algorithm. The custom f1-score for birth year was basically the same for both algorithms, with a slight edge given to logistic regression. Figure 9: Metrics for Random Forest Occupation Classifier Figure 10: Confusion Matrix for Random Forest Occupation Classifier Figure 11: Metrics for Random Forest Gender Classifier Figure 12: Confusion Matrix for Random Forest Gender Classifier Figure 13: Metrics for Random Forest Birth Year Classifier 4.3 Support-Vector Classifier A support-vector algorithm (SVC) was the last one to be experimented with. Figures 17- 21 display the results of this experimentation. Similar to the random forest classifier, no major improvements in metrics could be seen with the SVC, although it was still more successful than the baseline classifiers. The occupation classifier achieved a comparable accuracy to the random forest classifier, and its highest f1-score (0.798) was the lowest out of the three algorithms. The algorithm’s performance with gender classification and birth year classification was very close to those of the random forest and logistic regression algorithms. Figure 14: Metrics for SV Occupation Classifier Figure 15: Confusion Matrix for SV Occupation Classifier Figure 16: Metrics for SV Gender Classifier Figure 17: Confusion Matrix for SV Gender Classifier Figure 18: Metrics for SV Birth Year Classifier 4.4 Submission Based on the results described above, it was difficult to choose which combination of models was most effective, as the scores for each category were similar. Table 1 dis- plays the (rounded) f1-score that each classifier produced for each category. Ultimately, the final metric of evaluation was cRank [20]. The cRank value for logistic regression was 0.541, the value for random forest was 0.522, and the value for SVM was 0.535. This result, paired with the relative consistency of the metrics produced by the logis- tic regression models, indicated that a combination of three logistic regression models was the preferable algorithm to classify occupation, gender, and birth year. Thus, the software submitted to TIRA [14] contained three logistic regression models to predict labels (utilizing additional Python modules [18][5]) from the tweets of celebrity follow- ers. The final results were: a c-rank of 0.577, a birth year f1-score of 0.432, a gender f1-score of 0.681, and an occupation f1-score of 0.707 [20]. Occupation Gender Birth Year Logistic Regression 0.754 0.735 0.346 Random Forest 0.744 0.689 0.333 Support-Vector 0.738 0.706 0.349 Table 1: F1 Scores per Category for every Classifier 5 Conclusions Overall, the three ML algorithms that were chosen produced relatively equal results for the classification categories (occupation, gender, and birth year). The classifications of occupation and gender were the most successful, as the ultimate metrics of the trained models were clearly superior to those retrieved from the baseline models. Classifying birth year was the most complicated process and yielded smaller metrics due to the number of possible classes. Comparing the algorithms by cRank, logistic regression appeared to be the most effective. In the future we would like to experiment with a recurrent neural network to determine if it will offer better results than the algorithms we utilized. References 1. Argamon, S., Koppel, M., Pennebaker, J., Schler, J.: Automatically profiling the author of an anonymous text. Commun. ACM 52, 119–123 (02 2009). https://doi.org/10.1145/1461928.1461959 2. Bengfort, B., Bilbro, R., Danielsen, N., Gray, L., McIntyre, K., Roman, P., Poh, Z., et al.: Yellowbrick (2018). https://doi.org/10.5281/zenodo.1206264, http://www.scikit-yb.org/en/latest/ 3. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., Varoquaux, G.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning. pp. 108–122 (2013) 4. Fraj, M.B.: In depth: Parameter tuning for svc (2018), https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769 5. Grant, R.: ndjson 0.3.1. Python Module (2018), python ndjson support 6. Hodge, A., Huang, Z., Price, S.: Classification of time period and author age in fiction (2019), student project at Northeastern University 7. Hodge, A., Price, S.: Artificial intelligence final report: Celebrity profiling 2020 (2020), student project at Northeastern University 8. Honnibal, M., Montani, I.: spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear (2017) 9. Hunter, J.D.: Matplotlib: A 2d graphics environment. Computing in Science & Engineering 9(3), 90–95 (2007). https://doi.org/10.1109/MCSE.2007.55 10. Kim, T., Wurster, K.: emoji 0.5.4. Python Module (2014), identifies emojis in text 11. Koehrsen, W.: Hyperparameter tuning the random forest in python (2018), https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using- scikit-learn-28d2aa77dd74 12. Oliphant, T.E.: A guide to NumPy, vol. 1. Trelgol Publishing USA (2006) 13. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011) 14. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World. Springer (Sep 2019) 15. Qiao, F.: Logistic regression model tuning with scikit-learn - part 1 (2019), https://towardsdatascience.com/logistic-regression-model-tuning-with-scikit-learn-part-1- 425142e01af5 16. Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org (Sep 2014), http://ceur-ws.org/Vol-1180/ 17. Van Der Walt, S., Colbert, S.C., Varoquaux, G.: The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering 13(2), 22 (2011) 18. Varoquaux, G.: Joblib 0.16.0. Python Module (2010), python parallel computing 19. Wiegmann, M., Stein, B., Potthast, M.: Celebrity Profiling. In: 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). Association for Computational Linguistics (Jul 2019) 20. Wiegmann, M., Stein, B., Potthast, M.: Overview of the Celebrity Profiling Task at PAN 2020. In: Working Notes Papers of the CLEF 2020 Evaluation Labs. CLEF and CEUR-WS.org (Sep 2020)