Ensembled Approach for Web Search Result Diversification Using Neural Networks Shreya Sriram1 , Madhuri Mahalingam1 , Sarah Aymen Naseer1 , Shajith Hameed1 , Rahul Rajagopalan1 , Sai Shashaank R1 , Lekshmi Kalinathan1 and Prabavathy Balasundaram1 1 Sri Sivasubramaniya Nadar College of Engineering, Chennai, Tamil Nadu, India Abstract Result diversification provides a broader view of a topic, while maximizing the chances of retrieving relevant information. It avoids the bias in results, thus improving the user experience. This area finds a lot of applications in web searches and recommendation systems. The existing literature on this domain has achieved good accuracy on smaller datasets and using single models. An ensemble approach, using three neural network models, has been proposed to improve the existing predictions using a bigger dataset. Keywords Result diversification, Ensembling methods, Grid Search, Voting Regressor 1. Introduction Diversification of results is one of the most important trends in the areas of web searches, recommendation systems and structured databases. With the development of image resources in searches, retrieving diverse and relevant results for a query has become a challenging task. This is due to the requirement that the retrieved images should satisfy various semantic intents of the queries, based on the visual attributes and features of an image. Context information, such as captions, descriptions, and tags, provides opportunities for image retrieval systems to improve their result diversification. [8] The objective of result diversification is to provide user satisfaction, improve productivity, reduce bias or homogeneity in results and to be able to cater to alternate interpretations of a query. Diversification is relevant in image retrieval since it avoids retrieval of superficial results, it provides multifaceted results for a query. eg: If a user searches for pictures of cars, the system needs to retrieve images of different kinds of cars, taken at different locations and time of the day from various angles. Result diversification also guesses the intent of the query and obtains satisfactory results, for ambiguous and incomplete queries. [9] There are existing methods that use late fusion systems as shown in Table 1. The approach CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ shreya19106@cse.ssn.edu.in (S. Sriram); madhuri19057@cse.ssn.edu.in (M. Mahalingam); sarah19100@cse.ssn.edu.in (S. A. Naseer); shajith2010537@ssn.edu.in (S. Hameed); rahul2010222@ssn.edu.in (R. Rajagopalan); saishashaank2010084@ssn.edu.in (S. S. R); lekshmik@ssn.edu.in (L. Kalinathan); prabavathyb@ssn.edu.in (P. Balasundaram) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Table 1 Existing Work Existing Work Proposed Methodology Performance Domain Independent System Fu- Cross Space Fusion Layers,Attention F1 score: 0.2823 sion [3] Layers are used with Dense Layers Differential Evolution-Based Fusion Normalised Dis- Differential evolution-based fusion for Results Diversification of Web counted Cumulative (DE) Search [6] Gain (NDCG): 0.5476 Fusion-Based Methods for Result Linear Fusion F1 score: 0.3987 Diversification on the web [7] taken in [3] uses Deep Neural Networks architectures as the primary ensembling learner with various network configurations that use dense and attention layers. Cross- Space-Fusion layer has been used here to manipulate the newly created spatial information. When compared with the current state of the art and traditional ensembling approaches, the proposed model showed significant improvements, by a margin of at least 38.58%. Search result diversification of text documents is necessary when a user issues an ambiguous query to the search engine. Such ambiguous queries require a diversified resultant list that includes documents that are relevant to as many different types of subtopics as possible. A group of fusion-based result diversification methods with novel methods of weight assignment for linear combination is proposed in [7]. This aims to improve performance that considers both relevance and diversity. The study [6] uses data fusion for result diversification and it investigates how to use differential evolution to learn weights for the linear combination method. The latest developments in this field, using deep neural networks as the primary ensembling method has shown major improvements over the traditional ensembling methods by increasing the performance of the individual inducers [3, 4, 5]. The existing methods work more efficiently only under certain conditions therefore, it is of vital importance to come up with new computer vision and deep learning methods that can enable achieving diverse and appropriate search results for all kinds of data. In this context ImageCLEF [1], a benchmarking activity on the cross-language annotation and retrieval of images in the Conference and Labs of the Evaluation Forum (CLEF) has proposed the ImageCLEFfusion task [2]. 2. Task and Dataset Description Result diversification fusion task [2] aims to maximise the chances of retrieving relevant answers that correspond to the query. In the context of image retrieval, an inducer is generally responsible for retrieving a set of relevant images for the given query id. Output of the inducer consists of the relevant images, their similarity scores and ranks. However, a single inducer is disadvantageous for application in certain areas due to low precision and lack of performance. Hence, this task involves the use of ensembling to overcome this scenario. Ensembling is a technique which aggregates the predictions of several inducers. The training data is fed into three or more models and their predictions are then combined to obtain a final prediction using a fusion algorithm (ensembling). The ensembled system is expected to yield a better performance compared to the highest performing individual inducer. The data for this task is obtained from the Retrieving Diverse Social Images Task dataset [Ionescu2020]. The outputs of 56 inducers, representing a total of 123 queries (topics) are stored in separate text files. Each entry or row in these files is of the format as given below in the Table 2. Table 2 Attributes of Inducer file Fields Representation query_id unique id of the query inter ignored value photo_id unique photo id rank photo rank sim similarity score run_name name of inducer 3. Methodologies Used Different networks namely, Multilayer Perceptron, Ridge Regressor using Grid Search and Keras Regressor using Sequential model were studied to learn the patterns of the outputs of the inducers. 3.1. Multilayer Perceptron Regressor Neural networks are mathematical structures that are formed with a neuron as the fundamental element. Artificial neurons are arranged in layers and coupled to build neural networks. Multi Layer Perceptron (MLP) network is the composition of neurons as shown in Figure 1. It is a feed forward neural network made up of successive layers that communicate and exchange information via synaptic connections represented by an adaptive weight. The structure of a multilayer network includes multiple layers of perceptrons. The input layer has number of perceptions same as the number of data attributes, an output layer with one perceptron in the case of regression, and all other layers are considered to be hidden. The information flows unidirectionally, from input layer to output layer, through the hidden layers. The hidden layers are the computation engine of the MLP. The weight adjustment training is done via backpropagation. In this method, an error is calculated when the network output is compared to the expected output. The error is then propagated back through the network, one layer at a time, and the weights are modified based on their contribution to the error. Backpropagation is a method of repeatedly adjusting the weights in order to minimize the difference between the actual and desired output. In the regression scenario, activation function will not be applied for the output of the dense layer. Hence, this output will serve as the predicted one. Figure 1: Multi layer Perceptron model 3.2. Ridge Regressor using Grid Search method The Ridge regression model is a linear regression model upon which Grid Search is applied to find the hyperparameters. Grid search is a parameter tuning method in which a model is built and evaluated for each set of algorithm parameters specified in a grid. The method calculates the performance of all combinations of the specified hyperparameters and their values and gives the best option. Hyperparameters are the values that are manually set before training. If these are set appropriately then the performance of the model can be improved. To minimize the overfitting with the ordinary Grid search, stratified cross-validation is applied where samples are divided into K-folds at random. An iterative approach is used to divide the training data into k parts. In each iteration, one division is kept for testing, and the remaining k-1 partitions are used to train the model. In the next iteration, the next partition is the test data and the remaining k-1 is the train data. The model performance is recorded and average of results is provided. The advantage of this method is that it gives less biased results compared to test - train split. The GridSearchCV model from Scikit Learn1 is used to get the parameters. Grid Search is easy to implement and is reliable. 3.3. Keras Regressor using Sequential Model Regression is implemented using the KerasRegressor 2 class in Keras, which is applied over a Sequential model . A sequential model is a linear stack of layers, where each of the layers is a 1 https://scikit-learn.org/stable/ 2 https://keras.io/ neural network layer with exactly one input vector of n-dimensions and an output vector of n-dimensions. These vectors of n-dimensions are also called tensors or n-dimension matrices. The sequential model consists of an input, multiple hidden and output dense layers as shown in Figure 2. A dense layer is a regular deeply connected neural network layer that on an input returns or outputs the activated sum of the dot product of the inputs with the kernels or weights and the bias. 𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(𝑑𝑜𝑡(𝑖𝑛𝑝𝑢𝑡, 𝑤𝑒𝑖𝑔ℎ𝑡) + 𝑏𝑖𝑎𝑠), The dense layer comprises of neurons or input nodes that are activated based on an activation function. An activation function decides if a neuron or node should be activated or not. The activation function is responsible for transforming the summed weighted input from the node into the activation of the node or output for that input at each layer. These functions are used to enhance the performance of deep learning models. The simplest activation function is referred to as the linear activation, where no transform is applied at all. However, nonlinear activation functions are preferred as they allow the nodes to learn more complex structures in the data. The widely used non-linear activation functions include ReLu, softmax and sigmoid. The bias used in the calculation of the output is a constant value or vector that is added to the weighted sum. It helps in shifting the result of the activation function towards the positive or negative side , in other words it’s used to offset the result obtained. This addition of bias introduces flexibility and better generalization to the neural model. Figure 2: Sequential Neural Model 3.4. Voting Regressor Given the similarity scores of the m inducers corresponding to n queries, it is essential to apprehend the evaluation of similarity scores by the inducers. In order to maximize the similarity scores, voting regressor has been adapted as shown in Figure 3. A Voting Regressor is an ensemble meta-estimator which fits several base regressors on the whole dataset. Figure 3: Voting Regressor It consists of three different predictors namely, MLP Regressor (M1 ), Ridge Regressor using Grid Search (M2 ) and Keras Regressor using Sequential model (M3 ) for the evaluation of the results. Performance of theses models are provided in terms of Mean Absolute Errors (MAE). The regressors are ranked based on their MAE with the help of Rank Assigner. These ranks and the models are provided as input for the Voting Regressor[10][11] which will consider the weight of the occurrences of predicted values before averaging. Further, the performance of the voting regressor can again be measured with MAE. The predictions of the voting regressor will be considered to be improved, if the MAE of it is lesser than the other predictors. 4. Implementation The given dataset was split into 90% for training and 10% for validation data. The similarity score column is extracted from the dataset and is normalized to ensure similar data distribution, to achieve faster convergence. This serves as our input and output. Three predictor models M1 , M2 and M3 were built to study how similarity scores are assigned. The model M1 was implemented using sklearn.neural_network, which consists of an input layer of 5 neurons, 2 hidden layers of 6 & 5 neurons each and an output layer of 1 neuron. The random state is set to 5 to avoid random split of data at each iteration. A constant learning rate of 0.01 is initialized to control the step size in updating the weights. The model M2 was created to solve the regression using the hyperparameters tuned by Grid Search. A search space with all possible hyperparameters is defined by Grid Search. Cross validation of the data is performed 3 times by splitting the data into 10 folds to obtain multiple iterations of training and testing on the data. The best model is chosen and the performance of the cross validated is evaluated by setting the scoring parameter of GridSearchCV as neg_mean_absolute_error. The model M3 was created using an input layer of 5 neurons, 2 dense layers of output dimensions 5 and 1, stacked over each other. The optimizer function is set as Stochastic Gradient Descent (SGD) with a learning rate of 0.0008. The data is sampled into batches of size 5, such that every set of 5 inputs is used to predict the next input’s similarity score, for every predictor. The above predictors are trained with 90% of the training set and validated with the remaining training set. The predicted values are compared with the actual values of the inputs and the mean absolute deviation is calculated as error. Ranks have been assigned for the models M1 , M2 and M3 based on the error values. The Voting Regressor is constructed with the ranks obtained and the 3 predictor models. The regressor is trained and tested on the entire training and validation data respectively. In the training phase, the output of the voting regressor for every batch is the prediction for the next input. This is compared with the actual value and error is calculated as the difference between these values. Further, it is tested on the test data provided and the original similarity score of the data is replaced with the improved score predicted by the model. 5. Results and Analysis Validation dataset is used to test the models and voting regressor is used to predict the similarity scores. These predicted and the actual similarity score values were plotted for each one of them as shown in Figures 4,5,6,7. It is seen from figures 4 & 5, that the model M2 predicts better when compared to the model M1 . Further, it is also seen from Figure 7, that the Voting Regressor predicted the similarity score better when compared to the model M2 , as it utilized the weighted occurrences of predicted values before averaging. Table 3 shows the MAE and rank values for the base and voting regressor models. The voting regressor model has been tested with the CLEF test data and metrics - F1 measure and cluster recall are used to compare and analyze the performance of the results thus obtained. Cluster recall is a metric that assesses how many different clusters from the cluster labels are Figure 4: Actual and Predicted Values of Similarity Score using MLP Regressor Figure 5: Actual and Predicted Values of Similarity Score using Ridge Model using Grid Search Figure 6: Actual and Predicted Values of Similarity Score using Keras Regressor using Sequential model represented. A cluster recall of 0.4384 has been observed among the top 20 results. F1 measure is the harmonic mean of cluster recall and precision, where precision measures the number of relevant images among the top 20 results. An F1 measure of 0.5604 has been obtained for the top 20 results. The voting regressor is used to predict the updated similarity score values for the testing data which contains 175,591 entries. 10 different variations of the voting regressor were built by varying the parameters, iteration size. Table 4 illustrates the F1 scores and CR scores evaluated Figure 7: Actual and Predicted Values of Similarity Score using Voting Regressor Table 3 Results of the base and voting regressor models Models MAE Rank M1 (Keras Regressor using Sequen- 0.626 1 tial Model) M2 (MLP Regressor) 0.041 3 M3 (Ridge Regressor using Grid 0.032 2 Search) Voting Regressor 0.030 - Table 4 F1 score and Cluster Recall rates of 10 best runs Run F1@20 CR@20 No score score 1 0.4316 0.3167 2 0.5095 0.4053 3 0.5398 0.4276 4 0.4929 0.3906 5 0.5563 0.4332 6 0.4963 0.3743 7 0.5533 0.4341 8 0.5604 0.4373 9 0.5547 0.4384 10 0.5568 0.4362 for 10 best file submissions. 6. Conclusion In order to improve the predictions of the results of the inducers, the proposed ensemble model was implemented using three neural networks as base regressors. The model was trained on data from 56 different inducers, containing 167,139 training values and tested on data from 55 inducers, containing 175,591 testing values. The base regressors obtained MAE values of 0.626, 0.041 and 0.032 each. The ensemble method obtained an improved MAE score of 0.030. Among the ten best submissions, the best F1 score and CR score are 0.5604 and 0.4384 respectively. References [1] Bogdan Ionescu, Henning Müller, Renaud Péteri, Johannes Rückert, Asma Ben Abacha, Alba García Seco de Herrera, Christoph M. Friedrich, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Serge Kozlovski, Yashin Dicente Cid, Vassili Kovalev, Liviu-Daniel S, tefan, Mihai Gabriel Constantin, Mihai Dogariu, Adrian Popescu, Jérôme Deshayes-Chossart, Hugo Schindler, Jon Chamberlain, Antonio Campello, Adrian Clark, Overview of the ImageCLEF 2022: Multimedia Retrieval in Medical, Social Media and Nature Applications, in Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 13th International Conference of the CLEF Association (CLEF 2022), Springer Lecture Notes in Computer Science LNCS, Bologna, Italy, September 5-8, 2022. [2] Liviu-Daniel S, tefan, Mihai Gabriel Constantin, Mihai Dogariu and Bogdan Ionescu. Overview of the ImageCLEFfusion 2022 Task: Ensembling Methods for Media Interesting- ness Prediction and Result Diversification, in CLEF2022 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), Bologna, Italy, September 5-8, 2022. [3] Constantin, Mihai Gabriel, Liviu-Daniel Ştefan, and Bogdan Ionescu. "DeepFusion: Deep ensembles for domain independent system fusion." International Conference on Multimedia Modeling. Springer, Cham, 2021. [4] Constantin, M.G., S¸tefan, L.D., Ionescu, B., Duong, N.Q., Demarty, C.H., Sj¨oberg, M.: Visual interestingness prediction: A benchmark framework and literature review. Interna- tional Journal of Computer Vision pp. 1–25 (2021). [5] S¸tefan, L.D., Constantin, M.G., Ionescu, B.: System fusion with deep ensembles. In: Pro- ceedings of the 2020 International Conference on Multimedia Retrieval (ICMR 2020). pp. 256–260. Association for Computing Machinery (ACM) (2020). [6] Xu, Chunlin, Chunlan Huang, and Shengli Wu. "Differential evolution-based fusion for results diversification of web search." International Conference on Web-Age Information Management. Springer, Cham, (2016) [7] Wu, Shengli, et al. "Fusion-based methods for result diversification in web search." Infor- mation Fusion 45 (2019): 16-26. [8] Zheng, K., Wang, H., Qi, Z. et al. A survey of query result diversification. Knowl Inf Syst 51, 1–36 (2017). [9] Drosou, Marina, and Evaggelia Pitoura. "Search result diversification." ACM SIGMOD Record 39.1 (2010): 41-47. [10] Maxwell, D., Azzopardi, L. Moshfeghi, Y. The impact of result diversification on search behavior and performance. Inf Retrieval J 22, 422–446 (2019). [11] Sagi, Omer, and Lior Rokach. "Ensemble learning: A survey." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8.4 (2018): e1249.