Author Profiling From Images Using 3D Convolutional Neural Networks? Eduardo Valdez-Rodrı́guez, Hiram Calvo, and Edgardo Felipe-Riverón Centro de Investigación en Computación, Instituto Politécnico Nacional J.D. Bátiz e/ M.O. de Mendizábal, 07738, Ciudad de México, México jvaldezr1000@alumno.ipn.mx, hcalvo@cic.ipn.mx, edgardo@cic.ipn.mx Abstract. With this work we participate in the competition on the author profiling task on the MEX-A3T track at IberEval 2019. Author profiling task aims to identify gender, occupation and location from images or text of Mexican Twitter users. We propose a 3D Convolutional Neural Network for solving this task, using visual information, in this case images extracted from the user’s profile which are grouped to create a unique input of each user. Keywords: Author profiling · 3D CNN · Visual information 1 Introduction Author Profiling (AP) consists on identifying demographic information such as age, gender, lo- cation, occupation, native language, personality traits, etc. [8], to obtain a description from a particular user. This information can be obtained from several sources of information, in this case through texts and images from the user’s Twitter profile due to the large quantity of information available in this medium. In this work we focus on obtaining the profile of an author using images only; however, using only this source of information is a challenging task due to the interaction of users in social media, for example a group of friends can share the same nature of the image on their Twitter account. Another issue is related to gender equality, making gender identification complicated because now a man or woman can share the same images. That is why in this work we propose to work with groups of images instead of single images from a user’s profile. With this group of images we aim to capture the history of each user’s profile in such a way that it will not be of importance if another user uses the same image; to classify these groups of images we propose 3D Convolutional Neural Network (CNN) models capable of extracting features from these groups of images. This work is structured as follows: in Section 2 we describe the previous work related to this task; in Section 3 we describe our proposal; in Section 4 we show the results obtained from this work, and finally in Section 5 we draw our conclusions. 2 Previous work In this work we focus on Mexican Twitter users and on three classes of demographic information: gender, location and occupation; most of the previous works on AP have been devoted to the analysis of textual information, disregarding information from other modalities, such as visual information, that could be potentially useful for improving the performance of AP methods. Some of the first related works were devoted to obtainin gender using textual information are developed by Argamon et al. [3][4]. They proposed combinations of textual attributes ranging from lexical features such as content words and function words. Another work by Ortega et al. [12], uses syntactical features such as personal phrases to detect gender on texts. Merler et al. [11] uses images extracted from the user profile and categorize each one with independent classifiers, to obtain gender from those categories. Taniguchi et al. [13] categorizes ? We thank Instituto Politécnico Nacional (SIP, COFAA, EDD, EDI and BEIFI), and CONACyT for their support for this research. Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attri- bution 4.0 International (CC BY 4.0). IberLEF 2019, 24 September 2019, Bilbao, Spain. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) each image of the profile using CNNs and then performs classification on those categories to predict gender. Although Merler et al. [11] and Taniguchi et al. [13] work with images, they rely on an intermediate stage that recognizes certain features in images, and thus, an additional process of labeling and training is required. A similar work is made by Álvarez-Carmona et al. [1]; they use a multimodal approach to identify gender and age from Twitter users using both texts and single images. To our knowledge, there are no works devoted to author profiling using images directly, without depending on a previous categorization of images prior training. In the next sections we give a detailed description of our proposal. 3 Proposal In this section we describe our method using visual information from Mexican Twitter users; we used 3D CNN models and images to identify the AP gender: male and female, location: north, northwest, northeast, south, southeast and center, and occupation: arts, student, social, sciences, administrative, health, sports and others. It is worth to mention that we are working with Mexican Twitter users, so their native language is Spanish while all the previous mentioned work was focused on English language users. Another important point is that, as far as we know, there are no other works trying to predict location and occupation from both textual and visual information. Finally, in this work we use only visual information provided by the users, that is, we will use images taken from their Twitter profiles. From the aforementioned works we take the idea that CNNs are capable of extracting features from single images and classify them in different classes. We propose a 3D CNN capable of classify gender, location and occupation at the same time from groups of images instead of single images. 3.1 Dataset The dataset was provided by the AP task on the MEX-A3T at IberEval 2019 [2]. It consists of images and texts extracted from Mexican Twitter users to identify gender, occupation and location. For solving this task we use images only. The dataset contains 11 images per user (in few cases the number of images is less than 11 depending on the Twitter user), the full dataset consists of 3,500 users, resulting in 38,500 images approximately. In Fig. 1 we show samples of several images of this dataset. Fig. 1. Sample images from a Twitter user. By analyzing the images given by the users we found that using a single image as in Álvarez- Carmona et al. [1], determine all labels was not possible because some users, for example a man or woman, can share the same image. This is why we decided instead of using a single image, we could use a group of images shared by a user. As we are using a 3D CNN, we adapted the dataset grouping several images in a single volume for each user as shown in Fig. 2. 509 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Fig. 2. Group of images from a Twitter user as input for the 3D CNN, where p is the number of images per user. 3.2 3D CNN We use two 3D CNN models based on [5] and [6]; they use a 3D CNN to classify human actions on videos; this kind of CNNs work with temporal information such as videos; a video can be seen as a sequence of images in time. We do not posses that temporal relationship but we have groups of images that can be processed as sequences by the 3D CNN. The representation of a convolutional and fully connected layer is shown in Fig. 3. Fig. 3. Single layer representation of the convolutional layer and a fully connected layer. The 3D CNN Model 1 is shown in Figure 4 and it is based on [6], where h is the height of the patch, w is the width of the patch, p is the number of images per user—in this case we decided to use 11 images per group— and n is the batch size. It consists of four convolutional layers to extract features; in this CNN we use 3D kernels of size 2x3x3 and Max-pooling with size 2x2x2; finally we add a classification stage. 510 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Fig. 4. 3D CNN Model 1. The 3D CNN Model 2 is shown in Figure 5 and it is based on [5], where h is the height of the patch, w is the width of the patch, p is the number of images per user—in this case we decided to use 11 images per group— and n is the batch size. It consists of seven convolutional layers to extract features; in this CNN we use 3D kernels of size 2x3x3 and Max-pooling with size 2x2x2; finally we add a classification stage. Fig. 5. 3D CNN Model 2. The main differences between both models is the Max-pooling operation performed on the last layers of Model 2, this operation reduces the input volume just in h and w dimension to extract features at lower resolution. Also the number of layers used to extract features from the input volume is different in both models. Classification stage. We needed to adapt the output of the models to classify each profiling setting separately; we use Fully Connected layers to perform the classification, all of them with ReLU as activation function; as loss function we use the Categorical Crossentropy function [9] given by Eq. 1. M X X N L(y, ŷ) = − (yij ∗ log(yˆij )) (1) j=0 i=0 where ŷ is the predicted value and y the target value. Categorical crossentropy will compare the distribution of the predictions (the neurons in the output layer, one for each class) with the true distribution, where the probability of the true class is set to 1 and 0 for the other classes. This output layer consists of three stages, the first one classifies gender (male and female) using 2 output neurons, the second one classifies location (north, northwest, northeast, south, southeast and center) using 6 output neurons and finally the third one classifies occupation (arts, student, social, sciences, administrative, health, sports and others) using 8 output neurons. In Figure 6 we can observe how each stage is divided. 511 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Fig. 6. Classification stage, where r is the number of output neurons for each profiling task and r1 = 2, r2 = 6 and r3 = 8. 4 Experiments and results The dataset provided by [2] consists of training and test sets; the training set consists of 3,500 groups of images per user and 11 images for each group and the testing set consists of 1,500 groups of images per user with the same number of images. We trained both models using the training set and using the Backpropagation (BP) method [10] and ADAM optimizer [7] to update the weights during training, with learning rate equal to 0.05 and decay equal to 0.7; we used 100 iterations and a batch size of 13. We trained the model on two GPU NVIDIA GTX 1080Ti; it took five hours for training and less than a second for testing a single group of images. As mentioned before we were participants of the AP task on the MEX-A3T at IberEval 2019; we compare our method with CerpamidTeam as their results were made public for this contest as well; they use the textual information to perform their prediction. From Tables 1 to 6 we report the evaluation of our 3D CNN models, and present the comparison with CerpamidTeam. We use the metrics given by the organizers of the contest such as F measure, Accuracy (Acc), Precision (P) and Recall (R). Table 1. Comparison between our method and CerpamidTeam on gender. Method F(P, R) Acc P R CerpamidTeam 1 0.8366 0.8347 0.8412 0.8347 CerpamidTeam 2 0.8331 0.8313 0.8392 0.8313 3D CNN, Model 2 0.5227 0.5227 0.5230 0.5227 3D CNN, Model 1 0.4711 0.4840 0.4823 0.4840 Table 2. Comparison between our method and CerpamidTeam on location. Method F(P, R) Acc P R CerpamidTeam 1 0.5020 0.6307 0.5214 0.4934 CerpamidTeam 2 0.4844 0.6153 0.4988 0.4766 3D CNN, Model 1 0.1482 0.2393 0.1645 0.1573 3D CNN, Model 2 0.1098 0.1687 0.1674 0.1846 512 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Table 3. Acc comparison between our method and CerpamidTeam on different values for location. Method center southeast northwest north northeast west CerpamidTeam 1 0.6889 0.3864 0.6667 0.1667 0.7091 0.2890 CerpamidTeam 2 0.7021 0.3984 0.6651 0.2535 0.7297 0.2634 3D CNN, Model 1 0.4177 0.0430 0.0952 0.0260 0.2066 0.1008 3D CNN, Model 2 0.3606 0.0000 0.0867 0.0491 0.1377 0.0250 Table 4. Comparison between our method and CerpamidTeam on occupation. Method F(P, R) Acc P R CerpamidTeam 1 0.3957 0.6587 0.4331 0.3744 CerpamidTeam 2 0.3824 0.6613 0.4201 0.3619 3D CNN, Model 1 0.1196 0.2693 0.1235 0.1284 3D CNN, Model 2 0.0951 0.2300 0.1402 0.1260 Table 5. Acc comparison between our method and CerpamidTeam on each occupation. Method others arts student social sciences sports admin health CerpamidTeam 1 0.1026 0.2514 0.8597 0.5695 0.1982 0.3077 0.5134 0.2564 CerpamidTeam 2 0.1333 0.3333 0.8648 0.5508 0.2075 0.3500 0.4785 0.2469 3D CNN, Model 1 0.0000 0.0902 0.4495 0.1390 0.0654 0.0000 0.2131 0.0000 3D CNN, Model 2 0.0000 0.1181 0.4394 0.0701 0.0408 0.0000 0.0926 0.0000 4.1 Discussion As shown in previous tables we observe that predicting the AP of a Twitter user using only images is a difficult task due to the generality of purpose of images on this platform, and even if we look at the images is hard to classify them; nevertheless we obtain better results using the 3D CNN Model 2 on gender and using the 3D CNN Model 1 on location and occupation. As far a we know there are not other participants who solve the task using images, so we cannot compare with the methods using text. Also both models work with volumes of images instead single images as described in previous work. One disadvantage from our method is that we need the same number of images per user and if a user has less number of images we are not able to determine precisely his or her profile. 5 Conclusions and future work AP using only visual information such as images from Twitter is a difficult task due to the ambiguity and the source of them, although we did not obtain high results, we developed a new framework in which we can obtain AP using only images. Also we created a framework in which we can use groups of images instead of single images, avoiding the problem of shared images between users; additionally, our models allow to classify simultaneously directly from the images three different aspects of the profile. As a future work we plan to create a multimodal 3D CNN model which is able to combine visual and textual information. References 1. Álvarez-Carmona, M.A., Pellegrin, L., Montes-y Gómez, M., Sánchez-Vega, F., Escalante, H.J., López- Monroy, A.P., Villaseñor-Pineda, L., Villatoro-Tello, E.: A visual approach for age and gender identi- fication on Twitter. Journal of Intelligent & Fuzzy Systems 34(5), 3133–3145 (2018) 513 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 2. Aragón, M.E., Álvarez-Carmona, M.Á., Montes-y Gómez, M., Escalante, H.J., Villaseñor-Pineda, L., Moctezuma, D.: Overview of MEX-A3T at IberLEF 2019: Authorship and aggressiveness analysis in mexican spanish tweets. In: Notebook Papers of 1st SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Bilbao, Spain, September (2019) 3. Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, genre, and writing style in formal written texts. Text-Interdiciplinary Journal for the Study of Discourse, 23(3), 321–346 (2006) 4. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: Age, gender and the varieties of self-expression. First Monday 12(9) (2007) 5. Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 5822–5831 (2017) 6. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 221–231 (2010) 7. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 8. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and linguistic computing 17(4), 401–412 (2002) 9. LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems. pp. 396–404 (1990) 10. LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Neural networks: Tricks of the trade, pp. 9–48. Springer (2012) 11. Merler, M., Cao, L., Smith, J.R.: You are what you tweet. . . pic! gender prediction based on semantic analysis of social media images. In: 2015 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. IEEE (2015) 12. Ortega-Mendoza, R.M., Franco-Arcega, A., López-Monroy, A.P., Montes-y Gómez, M.: I, me, mine: The role of personal phrases in author profiling. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 110–122. Springer (2016) 13. Taniguchi, T., Sakaki, S., Shigenaka, R., Tsuboshita, Y., Ohkuma, T.: A weighted combination of text and image classifiers for user gender inference. In: Proceedings of the Fourth Workshop on Vision and Language. pp. 87–93 (2015) 514