Multi Regressor Based User Rating Predictor for ImageCLEF Aware 2022 Aarthi Suresh Kumar1 , Anirudh A1 , Jeet Golecha M1 , Karthik Raja A1 , Bhuvana Jayaraman 1 and Mirnalinee T T 1 1 Sri Sivasubramaniya Nadar College of Engineering, Chennai, Tamil Nadu, India Abstract Every one of the public nowadays have their presence in social media networks. The profile information of the social media account helps to understand nature of the user. Images, that are part of the profile information mostly characterizes the user and reveals much more about the user than the textual information. Such information extracted are used in many applications namely the employers, credit scoring, etc. This work has proposed Random forest regressor, Extra tree regressor and a dense neural network model for online user data scoring. Three submission using these models were made to the ImageClef Aware 2022 [1] task and has obtained 0.139 as Pearson Correlation Coefficient for testing. Keywords Multi Output Regressor, Random Forest, Extra Trees, Neural Network, User Rating 1. Introduction According to a recent report, people are uploading data online at the rate of 1.8 billion images per day. This statistic adds up to around 657 billion photos [2] every year [3]. Most of these image files are in social networking platforms which can be accessed publicly. However, the owners of these digital images are often unaware of the fact that third parties could access them for a plethora of unethical reasons. Examples include the practice of obtaining information of potential employees by employers and using a user’s online data to obtain an automatic credit score. Existing methods rate the information a user uploads online. For instance, Bargh et. al. [4] explored the implications of public user data in the area of user privacy. The paper outlined how user data could be used to derive sensitive information about a user. It also introduced a feedback system from the data recipients to the data disseminators to curb the issue of leaking private information. Other similar approaches focus on inferring user characteristics and their practical utility is rather limited. This paper aims to develop a more data-centric approach to solving the problem of online user CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ aarthi19003@cse.ssn.edu.in (A. S. Kumar); anirudh19015@cse.ssn.edu.in (A. A); jeetgolecha19043@cse.ssn.edu.in (J. G. M); karthikraja19048@cse.ssn.edu.in (K. R. A); bhuvanaj@ssn.edu.in (B. J. ); mirnalineett@ssn.edu.in (M. T. T. ) € https://www.ssn.edu.in/staff-members/dr-j-bhuvana/ (B. J. ); https://www.ssn.edu.in/staff-members/dr-t-t-mirnalinee/ (M. T. T. )  0000-0002-9328-6989 (B. J. ); 0000-0001-6403-3520 (M. T. T. ) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) data scoring. It explores the efficacy of two classes of models, namely, regression models and deep learning models to predict the pertinence of a user’s data to the following situations [5]: 1. Bank Loan 2. Accommodation 3. Applying to a job as a waitress/waiter 4. Applying to a job in IT The regression based models include the Random Forest Regressor, Extra Trees Regressor and the Mutli-Output Regressor. A dense neural network was the deep learning model used for the user data feedback system. Of these models, the Random Forest Regressor performed the best, with a validation error of 0.49. The regression class on models performed better than the deep learning model. 2. Task and Dataset ImageCLEF Aware 2022 [6]deals with developing model to predict the user ratings [7] for four distinct situations given the scores of different visual concepts. The models are expected to provide rankings for user test profiles that are as close as possible to the human rankings. The dataset has 1000 user profiles, each having 100 photos that were annotated along with an appeal score via crowd sourcing for the real life scenarios listed earlier. Each profile is rated globally [8] for every situation using a Likert scale of 7 that ranges from strongly unappealing to strongly appealing. Ground truth was created after averaging and normalizing the appeal score, which was then used for ranking the users in situation that are modeled. Prediction files, which contain visual concepts associated with each user, constitute the training data. Gt_files, which contain the the appeal score for each user for each real-life situation. A file with the score for each visual concept was provided as well. Incorporating the scores of each visual concept did not change the result. 3. Methodologies 3.1. Data Preprocessing Prior to applying the machine learning and deep learning techniques, some preprocessing techniques were applied. The location of the visual concepts and the scores for each real-life situation were concatenated and made into a stacked matrix for each user. The cases involved not adding some of this features to reduce diverging, but all patterns gave similar results on training accuracy. 3.2. Regression Models 3.2.1. Random Forest Regressor Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training Figure 1: Random Forest time. For classification tasks, The class chosen by majority of the trees will be the output class. A Random Forest [9] as an ensemble approach of decision trees, constructs as many trees in a random fashion as shown in in Figure 1. Each and every tree is constructed with different feature samples for splitting and at each node with different set of rows. Predictions are made by each tree which are combined / averaged together to give a single prediction for classification. 3.2.2. Extra Trees Regressor Extra Trees is an ensemble machine learning algorithm that combines the predictions from several decision trees. It is a commonly used random forest algorithm. Although it uses a simpler approach where the invidual members are the decision trees, it can often yield similar or better results than the random forest algorithm. Both the Random Forest Regressor and the Extra Trees Regressor are tree algorithms. The difference is that the Random Forest Regressor uses resampling and the Extra Trees Regressor uses original data to create the random forest of decision trees. 3.2.3. Multi Output Regressor The two models discussed above output only a single real value. Hence, they are modified to produce multiple outputs using the Multi Output Regressor(MOR) function. The MOR function runs the Random Forest and Extra Tree Regresors 4 times to get the values for each of the real-life situations. 3.3. Neural network Model A dense neural network was also explored. A dense neural network consists of dense layers. A dense layer is one that is connected to every neuron of its preceding layer. The dense neural network used for this task consists of 7 dense layers. The input is flattened into a 3000 point vector before passing it into the first layer of the dense neural network. The output of the dense neural network is a 4-point vector. The architecture of the deep learning model is shown in Figure 2. Figure 2: Deep Neural Learning Model 3.4. Training and Validation Set In this section, we present a concise anaylsis of the two best models: Random Forest Regressor and Extra Trees Regressor. The training accuracy of the former was less than the latter. This can be attributed to the fact that the Random Forest Regressor uses the concept of bootstrap re-sampling, bringing in new data that can diverge from actual data for training. On the other hand, the validation accuracy of the Random Forest Regressor was better than that of the Extra Trees regressor by approximately 0.01%. 4. Tested Models The regression and deep learning models were tested. The regression class of models performed better than the dense neural network model. The 7 layer dense neural network had a validation accuracy of 0.15. It followed the same preprocessing techniques as the regression models. We suspect that lack of data can attributed to this poor accuracy. Hence, we had to alter our model to a much simpler neural network that can work with smaller amount of data. 5. Hardware used A Google Colab notebook was used to train the model. A general purpose RAM size of 8GB was alloted with a 2.3GHz Intel Xenon CPU. 6. Code The resources used by JBTTM for CLEF aware, including the research papers, exploratory data analysis, and code can be found here: https://github.com/AAnirudh07/CLEF-2022 7. Result The Pearson Correlation Coefficient is a measure of linear correlation between two sets of data. The formula is provided Equation 1. Σ(𝑥𝑖 − 𝑥 ¯ )(𝑦𝑖 − 𝑦¯) 𝑟 = √︀ (1) ¯ )2 Σ(𝑦𝑖 − 𝑦¯)2 Σ(𝑥𝑖 − 𝑥 Team JBTTM had a maximum observed Pearson Correlation Coefficient of 0.139 for the two out of three submissions made. This correlation with Random Forest Regressors and the dense neural network. 8. Conclusion An attempt was made to score the user profile using the visual contents availble in the social network account. Two regressor methods and one dense neural network were used for this purpose . Of the 3 submissions that team JBTTM made, the regression model had the best accuracy based on the metrics proposed by CLEF. The accuracy of the 7 layer dense neural network model was inferior to the machine learning models and, no further improvements were made to it. In conclusion, machine learning models are more suitable for the task of user data rating than deep learning models. These results may also be attributed to the lack of training data. References [1] B. Ionescu, H. Müller, R. Peteri, J. Rückert, A. Ben Abacha, A. G. S. de Herrera, C. M. Friedrich, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, S. Kozlovski, Y. D. Cid, V. Kovalev, L.-D. Ştefan, M. G. Constantin, M. Dogariu, A. Popescu, J. Deshayes-Chossart, H. Schindler, J. Chamberlain, A. Campello, A. Clark, Overview of the ImageCLEF 2022: Multimedia retrieval in medical, social media and nature applications, in: Experimental IR Meets Multi- linguality, Multimodality, and Interaction, Proceedings of the 13th International Conference of the CLEF Association (CLEF 2022), LNCS Lecture Notes in Computer Science, Springer, Bologna, Italy, 2022. [2] V.-K. Nguyen, A. Popescu, J. Deshayes-Chossart, Unveiling real-life effects of online photo sharing, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 2898–2908. [3] T. Atlantic, How many photographs of you are out there in the world?, 2014. URL: https://www.theatlantic.com/technology/archive/2015/11/ how-many-photographs-of-you-are-out-there-in-the-world/413389/. [4] M. Bargh, P. Conradie, S. Choenni, R. Meijer, Privacy protection in data sharing: Towards feedback based solutions, volume 2014, 2014. doi:10.1145/2691195.2691279. [5] P. Li, Z. Wang, Z. Ren, L. Bing, W. Lam, Neural rating regression with abstractive tips generation for recommendation, in: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, 2017, pp. 345–354. [6] A. Popescu, J. Deshayes-Chossart, H. Schindler, B. Ionescu, Overview of the imageclef 2022 aware task, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 13th International Conference of the CLEF Association (CLEF 2022), LNCS Lecture Notes in Computer Science, Springer, Bologna, Italy, 2022. [7] P.-Y. Hsu, Y.-H. Shen, X.-A. Xie, Predicting movies user ratings with imdb attributes, in: International Conference on Rough Sets and Knowledge Technology, Springer, 2014, pp. 444–453. [8] N. Armstrong, K. Yoon, Movie rating prediction, Technical Report, Citeseer, 1995. [9] A. Ajesh, J. Nair, P. Jijin, A random forest approach for rating-based recommender system, in: 2016 International conference on advances in computing, communications and informatics (ICACCI), IEEE, 2016, pp. 1293–1297.