Movie Recommendation System* Weronika Wołowczyk1,∗,†, Ewa Szymik1,† Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND Abstract The goal of the project is to deliver personalized movie suggestions based on user preferences by analyzing and processing a dataset of movies. The project’s primary stages are data cleaning (as some unnecessary dataset columns were removed), exploratory data analysis (several visualizations of dataset characteristics were presented), and creating a recommendation system based on a soft-set theory. The core of the project is a recommendation system that makes movie suggestions based on user input. Users are asked to state their preferences concerning actors, genres, and keywords. Then, a soft set-based classification method is applied to score and rank the films depending on these preferences. The system calculates a total score for each movie based on its attributes, ultimately providing the top five propositions. There are also introduced methods for recommending 5 most similar movies to a given title and predicting movie ratings based on their features using k-nearest neighbor (knn) algorithm. In the first method, the algorithm searches for most similar movies based on their attributes and in the second it predicts a movie’s rating by analyzing the votes of the k films with the most similar features. Overall, the project presents the application of algorithmic techniques and machine learning methods such as soft-sets, to provide personalized suggestions, and k nearest neighbours algorithm to analyse data and predict data attributes. Keywords movie recommendation, softset, data analysis, personalized recommendations, data preprocessing, KNN algorithm, vote predictions 1. Introduction This increase in the number of films being produced each year creates a difficulty for the viewers, to chose a movie that suites the most to their liking. With all of the streaming services, users have a rich library of content available to them, and picking and choosing what to watch may get more and more complicated. That is why there is a need for recommendation systems: to provide personalized content and, as a result, to improve the user experience by recommending only the movies that might spark an interest and match the viewers’ individual preferences. Collaborative filtering, content-based recommendation and hybrid methods are the typical techniques that are employed by many existing recommendation systems. Collaborative filtering depends on preferences of other like-minded users whereas content-based filtering recommends items to the user based on descriptions of items. Hybrid methods combine both approaches to leverage their strengths and mitigate their weaknesses. *IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania 1,∗ CEUR ceur-ws.org Corresponding author Workshop † These author contributed equally. ISSN 1613-0073 Proceedings ww308053@student.polsl.pl (W. Wołowczyk); es308045@student.polsl.pl (E. Szymik) ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). This paper explores the use of soft set theory as an approach to movie recommendation systems. Soft sets, introduced by Molodtsov in 1999, are mathematical models used for reasoning under conditions of uncertainty and vagueness. Unlike traditional sets, where an element either belongs to the set or does not, soft sets allow for partial membership, with elements having degrees of belonging. This degree is typically represented by a value between 0 and 1, indicating how strongly an element is associated with the set. Soft sets are particularly useful in fields such as decision-making and artificial intelligence, where uncertainty and vagueness are common. In the context of movie selection, they provide a flexible classification method that can accommodate the varied nature of user preferences. The KNN (k-nearest neighbors) algorithm is a simple, non-parametric method for classification tasks in machine learning. It operates on the principle of proximity: it consists of finding 𝑘 closest objects in feature space to the element currently being tested . Therefore, regarding feature similarity, they are called neighbours. Neighbors are derived from a set of objects used to train the algorithm. The resulting class is the one in which there is the highest number of neighbours. Most often, the distance between the elements is calculated using Euclidean or Manhattan metric. The KNN classifier is used, firstly, to recommend 5 most similar movies to the one provided by a user, and secondly, to predict movie ratings based on the 𝑣𝑜𝑡𝑒_𝑎𝑣𝑒𝑟𝑎𝑔𝑒 attribute of similar movies. The steps of the algorithm are normalizing the data, splitting the data into training set and test set, fitting the model, and evaluating the accuracy of the predictions to determine the effectiveness of the model. Overall, main point of this project is the development of a personalized recommendation system that leverages user-defined preferences for genres, actors, and keywords. By applying soft set theory, the system calculates a total score for each movie based on its alignment with user preferences and displays the five films with the highest mark, resulting in customized movie suggestions. Second point is the recommendation of 5 most similar movies. Users are asked to input a title and then the system finds similar movies feature-wise. The prediction of movie ratings based on their features is the third point. The system finds similar movies in the training set and based on their 𝑣𝑜𝑡𝑒_𝑎𝑣𝑒𝑟𝑎𝑔𝑒 attribute predicts the rating of movies from test set. Both points use K-nearest neighbours algorithm. 2. Methodology Soft set methodology offers a flexible approach for handling uncertainties and making decisions based on multiple parameters. Its simplicity and adaptability make it a powerful tool for various applications, including recommendation systems. 2.1. SoP Set A soft set (𝐹, 𝐸) over a universal set 𝑈 is a pair where 𝐹 is a mapping given by 𝐹 : 𝐸 → 𝑃 (𝑈 ). Here, 𝐸 is a set of parameters, and 𝑃 (𝑈 ) denotes the power set of 𝑈 . For each parameter 𝑒 ∈ 𝐸, 𝐹 (𝑒) is a subset of 𝑈 . 2.2. Mathematical Model • Step 1: Define the Universal Set 𝑈 Let 𝑈 represent a universal set containing elements that need to be analyzed and catego- rized. In a movie recommendation system, 𝑈 is a set of films: 𝑈 = {𝑓1, 𝑓2, 𝑓3, . . . , 𝑓𝑛} • Step 2: Define the Set of Parameters 𝐸 Parameters 𝐸 define the attributes relevant to the elements in 𝑈 . These parameters could be movie genres: 𝐸 = {action, drama, comedy, adventure, . . .} • Step 3: Define the Mapping 𝐹 The mapping 𝐹 associates each parameter 𝑒 ∈ 𝐸 with a subset of 𝑈 . For instance, if the parameter is "adventure," 𝐹 (adventure) might include films classified as adventure films: 𝐹 (adventure) = {𝑓1, 𝑓2, 𝑓3} 𝐹 (drama) = {𝑓2, 𝑓4} 2.3. Constructing the SoP Set • Step 4: Construct the Soft Set (𝐹, 𝐸) The soft set is constructed by pairing each parameter with its corresponding subset: (𝐹, 𝐸) = {(adventure, {𝑓1, 𝑓2, 𝑓3}), (drama, {𝑓2, 𝑓4}), . . .} 2.4. Decision-Making Using SoP Sets • Step 5: Represent Soft Set in a Binary Table The soft set can be represented in a binary table for easier analysis. Each row corresponds to an element in 𝑈 , and each column corresponds to a parameter in 𝐸. An entry is 1 if the element is associated with the parameter, and 0 otherwise. U adventure drama comedy 𝑓1 1 0 0 𝑓2 1 1 0 𝑓3 1 0 1 ... ... ... ... • Step 6: Calculate Selection Values Assign weights to each parameter to reflect their importance. Multiply the binary values by these weights and sum them up for each element in 𝑈 . This gives a selection value indicating the relevance of each element based on the given parameters. • Step 7: Determine the Best Choice The elements with the highest selection values are considered the best choices based on the parameters. This can be used for recommendations. 2.5. Computational Example • Class 𝑈 : 𝑈 = {𝑓1, 𝑓2, 𝑓3, . . . , 𝑓𝑛} where 𝑓𝑖 are films. • Set of parameters 𝐸 defining movie genres: 𝐸 = {action, drama, crime, adventure, science-fiction, thriller, fantasy, western, animation, . . .} • Set of considered parameters 𝐴: 𝐴 = {adventure, fantasy, animation} • There are 6 films in the class 𝑈 : 𝑈 = {𝑓1, 𝑓2, 𝑓3, 𝑓4, 𝑓5, 𝑓6} 𝐸 = {𝑒1, 𝑒2, 𝑒3} • Assumption: 𝐹 : 𝐹 (𝑒1) = {𝑓1, 𝑓2, 𝑓3, 𝑓6} 𝐹 (𝑒2) = {𝑓1, 𝑓4, 𝑓6} 𝐹 (𝑒3) = {𝑓1, 𝑓3, 𝑓6} • Soft set 𝐹 : (𝐹, 𝐸) = {(adventure = {𝑓2, 𝑓6}), (fantasy = {𝑓1, 𝑓4, 𝑓6}), (animation = {𝑓3, 𝑓6})} • 𝑐𝑖- selected value of object 𝑓𝑖 ∈ 𝑈 • 𝑑𝑖𝑗 = 𝑤𝑗 × 𝑓𝑖𝑗 - input data of the weighted table , 𝑤𝑗 ∈ (0, 1] U adventure, 𝑤1 = 0.8 fantasy, 𝑤1 = 0.3 animation, 𝑤1 = 0.9 Selection Value 𝑓1 1 1 1 𝑐1 = 2 𝑓2 1 0 0 𝑐2 = 0.8 𝑓3 1 0 1 𝑐3 = 1.7 𝑓4 0 1 0 𝑐4 = 0.3 𝑓5 0 0 0 𝑐5 = 0 𝑓6 1 1 1 𝑐6 = 2 Table 1 Binary table with weights assigned to parameters, for adventure = 0.8; for fantasy = 0.3; for animation = 0.9 • In the table, it is evident that the films most corresponding to the selection parameters are 𝑓1 and 𝑓6. • The same calculations are performed for actors and keywords. Visualization of the recommendation system: Figure 1: Input of example preferences and matching results 2.6. K-nearest neighbours The KNN algorithm is a simple classifier that consist of finding 𝑘 elements in a given dataset that are most similar to the test element. It follows the steps: 1. Data Collection: Gathering training data, which will be used to build the model. Each data point is represented by a set of features and its corresponding class to be predicted. In this project, data points are movies from the database, class to be predicted is the vote_average value of the movie. 2. Determining the Value of Parameter K: The parameter K specifies how many nearest neighbors will be considered during the classification of a new data point. Choosing an appropriate value for K can significantly impact the effectiveness of the model. In the movie recommendation system, 𝑘 takes values from 2 to 9. 3. Calculating Distances: For a new data point whose class is to be predicted, distances to all points in the training set are calculated. This determines the similarity between points. In the project, the Manhattan metric is used: 4. Selecting K Nearest Neighbors: The next step is to select K training points that have the closest distances to the point currently tested. 5. Classifying the Point: After selecting the K nearest neighbors, the point is classified. The method for this is a majority vote, where the class of the new data point is determined by the dominant class among the K nearest neighbors. 6. Determining the accuracy The final step is to assess the performance of the KNN model. This is be done by splitting the data into a training and testing set, and then comparing the predicted classes with the actual classes in the testing set. 7. To avoid the dominance of features with larger values, feature normalization is applied before using KNN. 𝑥 − 𝑥min 𝑥norm = 𝑥max − 𝑥min 3. Experiments This chapter focuses on the experiments conducted to develop a machine learning model for movie recommendations. By testing various algorithms, we aim to enhance the accuracy and effectiveness of our recommendation system. Our goal is to bet- ter understand the key factors that contribute to successful movie recommendations, ultimately improving the user experience. 3.1. Database description The dataset utilized in this study was sourced from Kaggle.com, a widely recognized open-access platform renowned for its vast collection of publicly available datasets. Title of database is "TMDb Movies Dataset". There are 10856 records in total, which contain 21 columns. Figure 2: Database 3.2. Evaluation Metric Model will be measured by using metric accuracy. Accuracy is the most popular metric and it shows how often a classification of an ML model is correct overall. Where 𝑇𝑃 (True Positives) represent instances that were accurately identified as positive, 𝑇𝑁 (True Negatives) represent instances that were accurately identified as negative, 𝐹𝑁 (False Negatives) are instances where positive cases were incorrectly identified as negative, and 𝐹𝑃 (False Positives) are instances where negative cases were incorrectly identified as positive. 3.3. Model analysis Figure 3: Comparing accuracy for different 𝑘 using Standard normalization Figure 4: Comparing accuracy for different 𝑘 using Min-max normalization The figures above depict the accuracy of our KNN model across different 𝑘 values using standard and min-max normalization techniques. Our target variable, the rounded vote average, poses a challenge due to its unpredictability. Higher 𝑘 values generally lead to improved model performance, indicating more stable predictions as more neighbors are considered. Additionally, standard normalization slightly outperforms min-max normalization, when applied to features like runtime and release year. In summary, our experiments highlight the effectiveness of higher 𝑘 values and standard normalization in enhancing the predictive performance of our movie recom- mendation system. These findings emphasize the importance of careful normalization and 𝑘 value selection in predictive modeling tasks. 4. Conclusion This paper presents the design of a personalized movie recommendation system using soft set theory and the k-nearest neighbors (KNN) algorithm. The main goal is to build a system that recommends top five movies for a given user according to their preferences. Additional functionalities include suggesting similar movies to a given title, and predicting movie ratings based on data features. Soft Set Theory provides a powerful tool for dealing with the uncertainty and vague- ness associated with user’ preferences. Representing user preferences as soft sets allows calculating a total score for each movie and making personalized recommendations that align with users’ individual preferences. This shows the flexibility and efficiency of soft sets in decision making processes. The use of K nearest neighbours algorithm further expands the project. The KNN classifier identifies movies similar to a user-specified title and predicts movie ratings based on ratings of records closest in the feature space. The effectiveness of those predictions was checked. The accuracy oscillates for different values of 𝑘, increasing as 𝑘 increases and reaching the highest value of 44% when 𝑘 equals to 9 (the accuracy for higher values of 𝑘 was not checked). Experimental results validate the system’s effectiveness in generating personalized movie recommendations as recommended movies are, in fact, aligning with provided preferences. The accuracy of KNN classifier reached only 44% because higher values of 𝑘 were not tested and it is hard to predict movie ratings based solely on their features. Soft set theory and KNN have shown to be a potent combination for creating recom- mendation system that can process diverse user inputs and provide personalized movie suggestions. References [1] D. Molodtsov, “Soft set theory — First results,” Computers & Mathematics with Applications, vol. 37, no. 4-5, pp. 19-31, 1999. [2] F. Ricci, L. Rokach, and B. Shapira, “Introduction to Recommender Systems Hand- book,” in Recommender Systems Handbook, Springer, Boston, MA, 2011, pp. 1-35. [3] P. Lops, M. De Gemmis, and G. Semeraro, “Content-based Recommender Systems: State of the Art and Trends,” in Recommender Systems Handbook, Springer, Boston, MA, 2011, pp. 73-105. [4] TMDb, The Movie Database, available at: https://www.kaggle.com/datasets/ juzershakir/tmdb-movies-dataset/data