=Paper= {{Paper |id=Vol-3180/paper-114 |storemode=property |title=SSN CSE at ImageCLEFaware 2022: Contextual Job Search Feedback Score based on Photographic Profile using a Random Forest Regression Technique |pdfUrl=https://ceur-ws.org/Vol-3180/paper-114.pdf |volume=Vol-3180 |authors=Aarthi Nunna,Aravind Kannan Rathinasapabathi,Chirag Bheemaiah P K,Kavitha Srinivasan |dblpUrl=https://dblp.org/rec/conf/clef/NunnaRKS22 }} ==SSN CSE at ImageCLEFaware 2022: Contextual Job Search Feedback Score based on Photographic Profile using a Random Forest Regression Technique== https://ceur-ws.org/Vol-3180/paper-114.pdf
          SSN CSE at ImageCLEFaware 2022: Contextual Job Search
        Feedback Score based on Photographic Profile using a Random
                        Forest Regression Technique
    Aarthi Nunna1, Aravind Kannan Rathinasapabathi2, Chirag Bheemaiah P K3 and Kavitha
    Srinivasan4,
         Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam – 603110, India.
    1 ,2 ,4

    3
        Department of IT, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam – 603110, India.


                  Abstract
                  Social networks have become increasingly popular with millions of users where the digital
                  presence has become more crucial to a person's character judgement. Employers tend to screen
                  through their candidates' profiles on social media to understand their personalities and infer the
                  knowledge about the candidate’s eligibility for a specific job. To address this issue, the
                  ImageCLEF forum is conducting a task to quantify the effect of the photographic profile from
                  2021 onwards and we participated this year. Therefore, an algorithm was developed to score the
                  images of a user and provide them with comprehensive feedback on the consequences of the
                  images on their selected professions. The approach used to develop the algorithm uses Random
                  Forest Regression which resulted in a Pearson Correlation Coefficient of 0.544.

                  Keywords 1
                  Supervised learning, Random Forest Regression Algorithm, Social photographic profile-based
                  score, Pearson Correlation Coefficient

1. Introduction

    In today’s world, the digital presence of humans has become more pivotal than their physical presence.
With the ease of internet access, everybody is more digitally conscious than social. As a result, social
networking is undoubtedly an integral part of life. Every day, millions of users upload content such as
images, posts, and stories on platforms like Instagram, Twitter, Facebook, etc. The actions and posts on
social media can have real-time effects on the physical world, it can even affect a person’s ability to acquire
a job. So, it has become crucial to learn the effects of visual media uploaded on various social platforms as
explained by Van-Khoa Nguyen et al. [1]. If users are digitally responsible and disciplined when uploading
various visual media, it benefits the society as a whole because their digital presence wouldn’t have any
adverse effects on their career and its growth.
    The ImageCLEFaware 2022 is the second edition of the aware task conducted by the CLEF Initiative.
The task asked participants to provide a global rating of each profile in each situation using a Likert Scale.
In the edition held in 2021 [2], 500 user profiles were provided in the dataset opposed to 1000 user profiles
in this edition. This forum has taken this socially significant issue into their hands again, for the second
time, and put together a dataset of various users along with the pictures they posted in an anonymized
format for us to analyze the real-world effect on four selected situations, namely bank loan,
accommodation, jobs as a waitress/waiter, and jobs in IT. The final objective of the task would beto integrate
the model with a mobile application for users to obtain their feedback efficiently and easily.
    Our team strived to develop an algorithm that provides feedback to the users that resembles feedback
given by humans. The dataset used is a subset of the YFCC100M [3] dataset. It comprises various user
profiles. Each profile constitutes a maximum of hundred images. A thousand user profiles were used to
train our model. The objects present in the images are initially detected by a Faster-RCNN model, resulting
in a confidence score for each object detected. Thus, our models have taken as input a JSON file that
comprises the object detected along with its confidence score and its bounding box. We experimented with

1
  CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
EMAIL: aarthi19002@cse.ssn.edu.in (A. 1); aravindkannan19022@cse.ssn.edu.in (A. 2); chiragbheemaiahpk19025@it.ssn.edu.in (A. 3);
kavithas@ssn.edu.in (A. 4);
ORCID: 0000-0002-5944-5712(A. 1); 0000-0003-3783-372X (A. 2); 0000-0003-3315-2994 (A. 3); 0000-0003-3439-2383 (A. 4);
                 ©️ 2022 Copyright for this paper by its authors.
                 Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                 CEUR Workshop Proceedings (CEUR-WS.org) Proceedings
the algorithms explained in Section 3 and 4 and found Random Forest Regression to be the best performing
algorithm.
      The following sections of the paper are: Section 2 describes the dataset provided by ImageCLEF, the
various models that were used to obtain the required outputs in Section 3, the comparison between the
models and the inferences obtained are discussed in Section 4. Finally, we have summarized the findings
in the conclusion and future work Section.

2. Dataset
   The dataset was given with a split of three categories, namely training, validation, and testing data by
the ImageCLEF forum. The testing data was used to obtain the final output file which was submitted for
evaluation. Table 1 provides a comprehensive understanding of the dataset and additional information
about the same.

Table 1
Dataset description

         File Provided                        Data Provided                       Observed
       Class_scores.json          Each visual concept detected has a         Scores for visual concepts
                                  score depicting its influence on the      80 and 215 are unavailable.
                                           four professions


  Prediction_train.json,          The three folders are pertaining to         The folder contains the
   Prediction_val.json,               the input files for the three         users’ photographic profiles
   Prediction_test.json           respective dataset categories. The        that comprises of each user,
                                  train and validation input files are      their respective images and
                                   used to train our various models.           the objects detected.
                                     The test input file was used to
                                  obtain the ground truth output file
                                    to be submitted for evaluation.



   Gt_train.json, Gt_val.json       The final output ranks of each           The file comprises of each
                                   user’s profile with respect to the        user and four values that
                                       four professions chosen.              determine how the social
                                                                              profile of the user would
                                                                            affect his/her career choice.



3. System Design

   The system design of the developed model is visually represented in Figure 1. As observed, there are
multiple components that serve as the input for the model which are obtained from the dataset explained
in Table 1
Figure 1: System Design

    Figure 1 depicts a high-level abstraction of the proposed system design. The system takes the dataset as
input, pre-processes it and then trains a model on it. The model was then evaluated based on the ground
truth values and a set of performance metrics such as the Mean Absolute Error (MAE), Mean Squared Error
(MSE) and Root Mean Squared Error (RMSE). On obtaining these values, the model’s performance was
improved by varying its input and the model’s parameters.
    The input data consists of the class scores which contains the score that depict the influence of an object
on each profession that may be of either positive or negative value. Secondly, the user profiles along with
their images and the objects within them are fed as input. However, the data was pre- processed to fit the
various machine learning models. The JSON files are read into data frames for easy processing. The final
inputs to the machine learning model are varied and their performance as inputs are observed and tabulated
as explained in the following paragraphs.
    The output JSON file contains the model’s predictions of the scores that should be associated to each
user’s photographic profile on considering the user’s images and the objects within them. The expected
output given in the ground truth files are used to calculate the accuracy of the trained model using which
changes to the model are performed and analyzed.

3.1     Random Forest Regression

   The bagging model is used by the Random Forest Algorithm [4] as visualized in Figure 2. This means
that subsets of the dataset are used to train various decision trees and the final output is taken through the
idea of majority voting. The concept of majority voting is also known as aggregation. Thus, through the
methods of replacement and bootstrap aggregation, random forest was observed to be the best algorithm
amongst its competitors XGBoost [5] and ANN.
   Mentioned below are the different versions of the Random Forest model which are unique due to the
input parameters that are fed to the models. The models’ accuracies are measured using the error values
obtained.




Figure 2: Random Forest Regression
     Figure 2: Random Forest Regression explains the working of Random Forest Regression which
 utilizes the concept of bagging. As depicted in the figure, subsets of the input data are used to train
 different decision trees, whose predictions are averaged to obtain the final prediction.

3.1.1. Model 1
    In this approach, the input was defined as the ‘average confidence score’, along with ‘average impact
 scores for each of the classes’ for a given user. A random forest regressor was defined with the ‘number
 of estimators’ parameter ranging from 10 to 1000. This regressor was fit on 80% of the training data
 with the remaining being reserved for testing. A loss function of ‘Mean Squared Error’ was used to grade
 the performance of the result. The same plan of action was adopted for validation data. All the models
 were measured parametrically using Pearson Correlation Coefficient [7]. Figure 3, illustrates the
 model’s performance on the training data.




                          Figure 3: Training Data

    The number of estimators for the regressor was decided to be 650. This regressor was then applied
 to the testing data which gave the following results as given in Table 2. The Pearson correlation
 coefficient value was calculated to be 0.288.

Table 2
Model 1 Metrics

              Metrics                         Training Dataset               Validation Dataset
      Mean Absolute Error (MAE)                   0.36241                         0.37192
      Mean Squared Error (MSE)                      0.20515                        0.23345

   Root Mean Squared Error (RMSE)                   0.45294                        0.48317



3.1.2. Model 2

    In addition to the confidence score and the average impact scores, the objects detected were
 represented using a matrix whose index represents the object and the value at the index represents the
 count of the object was also provided as the input.
    To find the optimal number of estimators, a similar approach as the previous method was adopted.
 Figure 4 illustrates the model’s performance over the training and validation phases.
                           Figure 4: Training Data


    The number of estimators for the regressor was decided to be 650. This regressor was then applied
 to the testing data which gave the following results as given in Table 3. The Pearson correlation
 coefficient value was calculated to be 0.544.

Table 3
Model 2 Metrics

              Metrics                         Training Dataset                Validation Dataset
      Mean Absolute Error (MAE)                   0.371921                         0.328819
      Mean Squared Error (MSE)                    0.176763                         0.166127

   Root Mean Squared Error (RMSE)                 0.407587                         0.420433



3.1.3. Model 3

     Having ascertained that Random Forest Regression is the best fit model for the problem, parameter
 tuning was performed to optimize the model. The following parameters were considered and set with
 different options using grid search [8] as follows:
      1. 'bootstrap': [True, False],
      2. 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
      3. 'max_features': ['auto', 'sqrt'],
      4. 'min_samples_leaf': [1, 2, 4],
      5. 'min_samples_split': [2, 5, 10],
      6. 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
     The options were then validated using 3-fold cross validation approach to determine the leading
 model. The optimal parameters retrieved were: 'bootstrap': True, 'max_depth': 50, 'max_features': 'auto',
 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 2000. The Pearson correlation coefficient
 value was calculated to be 0.542 and Table 4 tabulates the metrics obtained for the same.

Table 4
Model 3 Metrics

              Metrics                         Training Dataset                Validation Dataset
      Mean Absolute Error (MAE)                   0.333413                         0.339431
      Mean Squared Error (MSE)                    0.185843                         0.193637

   Root Mean Squared Error (RMSE)                 0.431096                         0.440042
3.1.4. Model 4

     Like the above versions, an additional feature was added in order to improve the model’s accuracy.
 In the dataset used for training, the objects in various images are identified along with their confidence
 score and the coordinates of their bounding box. With the knowledge of the coordinates, the area of the
 bounding box was calculated using simple geometry as they always form a rectangular shape. The area
 of the bounding was used as it would account for the importance weight of the object in the image along
 with its confidence score.
     Figure 5 illustrates the model’s performance of training data. The Pearson correlation coefficient
 values was resulted as 0.519 and Table 5 tabulates the metrics obtained for the same.




                           Figure 5: Training Data


Table 5
Model 4 Metrics

              Metrics                          Training Dataset                Validation Dataset
      Mean Absolute Error (MAE)                    0.343690                         0.337131
      Mean Squared Error (MSE)                    0.198958                          0.189130

   Root Mean Squared Error (RMSE)                 0.446047                          0.420433



4. Implementation and Results

    In this section, the aforementioned machine learning models are trained, and the corresponding
 results are compared and analyzed using performance metrics, namely the Mean Squared Error, Mean
 Absolute Error and Root Mean Squared Error.

4.1     System Specification

    The hardware and software specification required for the implementation the machine learning
 models includes, Intel i7 processor with NVIDIA MX100 2GB graphics card, 8GB RAM, 1TB disk
 space, Windows 11 OS, Jupyter Notebook, Python 3.7 packages with required libraries like Sklearn,
 Tensorflow, Numpy, Pandas, etc.
4.2     Results of Machine Learning Models

    Since the problem warrants a multivariate regression approach with multiple target variables,
 XGBoost, Artificial Neural Network, Random Forest Regression models were considered. Thorough
 experiments were conducted, and the results are depicted below as follows in Table 6.

 Table 6
 Model accuracy comparisons

                ML Models                 Training Dataset (MSE)          Validation Dataset (MSE)
        Random Forest Regression                    0.1661                         0.1767
               XGBoost                              0.1870                         0.1966
         Artificial Neural Network                  0.1761                         0.2160


    From the above data it can be inferred that the Random Forest Regression model is well suited for
 the problem statement and hence was chosen as the baseline model for the given task.

    The accuracies of the regression models were evaluated on the test set based on the Mean Squared
 Error (MSE) metric. In the first run, the ‘Model 1’ was trained by setting the number of estimator
 parameter as 650 after iterating over values in the range of [10,1000]. This model took as input the
 ‘average confidence score’, an ‘average impact scores for each of the classes.’ In Run 2, ‘Model 2’ took
 in similar inputs but additionally accounts for all the images detected in a user’s profile. In Run 3,
 ‘Model 2’s’ hyperparameters are altered to achieve better performance. Besides the inputs which were
 given to ‘Model 2’, area of the bounding boxes of the objects detected was calculated and was used as
 an input.

 Table 7
 Brief description about each run

      Run Number        Approach        MSE-Training         MSE-Validation     Pearson’s Correlation
                                                                                     Coefficient

            1            Model 1           0.1661               0.1767                  0.288
            2            Model 2           0.1870               0.1966                  0.544
            3            Model 3           0.1761               0.2160                  0.542
            4            Model 4           0.1989               0.1891                  0.519



    Inferring from the results tabulated in Table 7Table , it is evident that the inclusion of the objects
 detectedper user was a key factor in improving the prediction performance metrics. Furthermore, it was
 observedthat hyperparameter tuning did not affect the performance of the model substantially.

5. Conclusion and Future Works

    The paper aims to devise a solution for the ImageCLEFaware 2022 Task. The task aims to provide
 a solution to generate contextual feedback scores for a user’s social profile and its influence in their job
 prospects. The paper describes the various models that were implemented, and their performances have
been compared. It was observed that the Random Forest Regression model performed far better than
the other models such as XGBoost and ANN. On inferring the same, the inputs given to the Random
Forest Regression Model were tweaked and different Random Forest models were developed. These
models were compared based on Mean Absolute Error, Mean Squared Error and Root Mean Squared
Error. Model 2 has fared better than the other models as per the Pearson Correlation Coefficient value.
This could be a direct consequence of the consideration of the objects detected being fed as an input
parameter. Hence, this can be worked on further to enhance its correlation with the required output.

    In the future, the dataset can be made more diverse to cover all edge cases and thus aid in developing
a more robust algorithm. Other ensemble learning algorithms can be experimented to arrive at
meticulous conclusions that will help improve the model’s performance. Thus, fine tuning the hyper
parameters of the algorithms such as epochs, learning parameters, cross-validation etc. can increase the
efficiency and improve the results that are currently obtained.

6. Acknowledgements

   We express our deep gratitude towards CLEF Initiative labs for coming up with the problem
statement for us to work on and giving us timely assistance. It was due to ImageCLEF 2022 Aware
[9,10] that we learnt a lot during the contest, so we’re forever indebted to them. We appreciate
AI4Media to support this task. We are grateful to the YDSYO Team for sharing with us the anonymized
dataset. We would also like to take this opportunity to thank our college, Sri Sivasubramaniya Nadar
College of Engineering, Department of Computer Science and Engineering for motivating us with the
opportunityto work on this task.

7. References

[1] Van-Khoa Nguyen, Adrian Popescu, and Jérôme Deshayes-Chossart. "Unveiling Real-Life Effects
    of Online Photo Sharing." IEEE WACV 2022.
[2] Bogdan Ionescu, Henning Müller, Renaud Péteri, Asma Ben Abacha, Mourad Sarrouti, Dina
    Demner-Fushman, Sadid A. Hasan, Serge Kozlovski, Vitali Liauchuk, Yashin Dicente, Vassili
    Kovalev, Obioma Pelka, Alba García Seco de Herrera, Janadhip Jacutprakart, Christoph M.
    Friedrich, Raul Berari, Andrei Tauteanu, Dimitri Fichou, Paul Brie, Mihai Dogariu, Liviu Daniel
    Ştefan, Mihai Gabriel Constantin, Jon Chamberlain, Antonio Campello, Adrian Clark, Thomas A.
    Oliver, Hassan Moustahfid, Adrian Popescu, Jérôme Deshayes-Chossart, Overview of the
    ImageCLEF 2021: Multimedia Retrieval in Medical, Nature, Internet and Social Media
    Applications, in Experimental IR Meets Multilinguality, Multimodality, and Interaction.
    Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021),
    Bucharest, Romania, Springer Lecture Notes in Computer Science LNCS, September 21-24, 2021.
[3] Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland,
    Damian Borth, Li-Jia Li. “YFCC100M: The New Data in Multimedia Research” (2016) URL:
    https://doi.org/10.48550/arXiv.1503.01817
[4] Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001). URL:
    https://doi.org/10.1023/A:1010933404324
[5] Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." In Proceedings of
    the 22nd acm sigkdd international conference on knowledge discovery and data mining, (2016):
    785-794.
[6] Afroz       Chakure,     Random       Forest     Regression,     Medium       Article.    URL:
    https://miro.medium.com/max/1400/0*f_qQPFpdofWGLQqc.png
[7] Kirch W, Pearson’s Correlation Coefficient. Encyclopedia of Public Health. Springer, Dordrecht
    (2008). URL: https://doi.org/10.1007/978-1-4020-5614-7_2569
[8] Petro Liashchynskyi, Pavlo Liashchynskyi. “Grid Search, Random Search, Genetic Algorithm: A
    Big Comparison for NAS” (2019) URL: https://doi.org/10.48550/arXiv.1912.06059
[9] Bogdan Ionescu, Henning Müller, Renaud Péteri, Johannes Rückert, Asma Ben Abacha, Alba
    García Seco de Herrera, Christoph M. Friedrich, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-
    Yaghir, Henning Schäfer, Serge Kozlovski, Yashin Dicente Cid, Vassili Kovalev, Liviu-Daniel
    Ștefan, Mihai Gabriel Constantin, Mihai Dogariu, Adrian Popescu, Jérôme Deshayes-Chossart,
    Hugo Schindler, Jon Chamberlain, Antonio Campello, Adrian Clark, Overview of the ImageCLEF
    2022: Multimedia Retrieval in Medical, Social Media and Nature Applications, in Experimental
    IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 13th International
    Conference of the CLEF Association (CLEF 2022), Springer Lecture Notes in Computer Science
    LNCS, Bologna, Italy, September 5-8, 2022.
[10] Adrian Popescu and Jérôme Deshayes-Chossart and Hugo Schindler and Bogdan Ionescu,
     Overview of the ImageCLEF 2022 Aware Task in Experimental IR Meets Multilinguality,
     Multimodality, and Interaction. Proceedings of the 13th International Conference of the CLEF
     Association (CLEF 2022), Lecture Notes in Computer Science LNCS, Springer, Bologna, Italy,
     September 5-8, 2022.