Movie Recommendation System *

Movie Recommendation System * WeronikaWołowczyk EwaSzymik Faculty of Applied Mathematics Silesian University of Technology

Kaszubska 23 44100 Gliwice POLAND

Information Society University Studies

2024, May 17 Kaunas Lithuania

Movie Recommendation System * 1613-0073 4D313CEDA2AAD054496CEFE7B4B82BB9 GROBID - A machine learning software for extracting information from scholarly documents movie recommendation softset data analysis personalized recommendations data preprocessing KNN algorithm vote predictions

The goal of the project is to deliver personalized movie suggestions based on user preferences by analyzing and processing a dataset of movies. The project's primary stages are data cleaning (as some unnecessary dataset columns were removed), exploratory data analysis (several visualizations of dataset characteristics were presented), and creating a recommendation system based on a soft-set theory.

The core of the project is a recommendation system that makes movie suggestions based on user input. Users are asked to state their preferences concerning actors, genres, and keywords. Then, a soft set-based classification method is applied to score and rank the films depending on these preferences. The system calculates a total score for each movie based on its attributes, ultimately providing the top five propositions.

There are also introduced methods for recommending 5 most similar movies to a given title and predicting movie ratings based on their features using k-nearest neighbor (knn) algorithm. In the first method, the algorithm searches for most similar movies based on their attributes and in the second it predicts a movie's rating by analyzing the votes of the k films with the most similar features.

Overall, the project presents the application of algorithmic techniques and machine learning methods such as soft-sets, to provide personalized suggestions, and k nearest neighbours algorithm to analyse data and predict data attributes.

Introduction

This increase in the number of films being produced each year creates a difficulty for the viewers, to chose a movie that suites the most to their liking. With all of the streaming services, users have a rich library of content available to them, and picking and choosing what to watch may get more and more complicated. That is why there is a need for recommendation systems: to provide personalized content and, as a result, to improve the user experience by recommending only the movies that might spark an interest and match the viewers' individual preferences.

Collaborative filtering, content-based recommendation and hybrid methods are the typical techniques that are employed by many existing recommendation systems. Collaborative filtering depends on preferences of other like-minded users whereas content-based filtering recommends items to the user based on descriptions of items. Hybrid methods combine both approaches to leverage their strengths and mitigate their weaknesses. This paper explores the use of soft set theory as an approach to movie recommendation systems. Soft sets, introduced by Molodtsov in 1999, are mathematical models used for reasoning under conditions of uncertainty and vagueness. Unlike traditional sets, where an element either belongs to the set or does not, soft sets allow for partial membership, with elements having degrees of belonging. This degree is typically represented by a value between 0 and 1, indicating how strongly an element is associated with the set. Soft sets are particularly useful in fields such as decision-making and artificial intelligence, where uncertainty and vagueness are common. In the context of movie selection, they provide a flexible classification method that can accommodate the varied nature of user preferences.

The KNN (k-nearest neighbors) algorithm is a simple, non-parametric method for classification tasks in machine learning. It operates on the principle of proximity: it consists of finding 𝑘 closest objects in feature space to the element currently being tested . Therefore, regarding feature similarity, they are called neighbours. Neighbors are derived from a set of objects used to train the algorithm. The resulting class is the one in which there is the highest number of neighbours. Most often, the distance between the elements is calculated using Euclidean or Manhattan metric.

The KNN classifier is used, firstly, to recommend 5 most similar movies to the one provided by a user, and secondly, to predict movie ratings based on the 𝑣𝑜𝑡𝑒_𝑎𝑣𝑒𝑟𝑎𝑔𝑒 attribute of si mi l ar movies. The steps of the algorithm are normalizing the data, splitting the data into training set and test set, fitting the model, and evaluating the accuracy of the predictions to determine the effectiveness of the model.

Overall, main point of this project is the development of a personalized recommendation system that leverages user-defined preferences for genres, actors, and keywords. By applying soft set theory, the system calculates a total score for each movie based on its alignment with user preferences and displays the five films with the highest mark, resulting in customized movie suggestions. Second point is the recommendation of 5 most similar movies. Users are asked to input a title and then the system finds similar movies feature-wise. The prediction of movie ratings based on their features is the third point. The system finds similar movies in the training set and based on their 𝑣𝑜𝑡𝑒_ 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 attribute predicts the rating of movies from test set. Both points use K-nearest neighbours algorithm.

Methodology

Soft set methodology offers a flexible approach for handling uncertainties and making decisions based on multiple parameters. Its simplicity and adaptability make it a powerful tool for various applications, including recommendation systems.

SoP Set

A soft set ( , 𝐹 𝐸) over a universal set 𝑈 is a pair where 𝐹 is a mapping given by 𝐹 : 𝐸 → 𝑃 (𝑈 ). Here, 𝐸 is a set of parameters, and 𝑃 (𝑈 ) denotes the power set of 𝑈 . For each parameter 𝑒 ∈ 𝐸, 𝐹 (𝑒) is a subset of 𝑈 .

Mathematical Model

• Step 1: Define the Universal Set 𝑈 Let 𝑈 represent a universal set containing elements that need to be analyzed and categorized. In a movie recommendation system, 𝑈 is a set of films: The mapping 𝐹 associates each parameter 𝑒 ∈ 𝐸 with a subset of 𝑈 . For instance, if the parameter is "adventure," 𝐹 (adventure) might include films classified as adventure films:

𝑈 = {𝑓 1 , 𝑓 2 ,𝐹 (adventure) = {𝑓 1 , 𝑓 2 , 𝑓 3 } 𝐹 (drama) = {𝑓 2 , 𝑓 4 }

Constructing the SoP Set

• Step 4: Construct the Soft Set ( ,

𝐹 𝐸)

The soft set is constructed by pairing each parameter with its corresponding subset:

( , The elements with the highest selection values are considered the best choices based on the parameters. This can be used for recommendations.

𝐹 𝐸) = {(adventure, {𝑓 1 , 𝑓 2 , 𝑓 3 }), (drama, {𝑓 2 , 𝑓 4 }), . . .}

Decision-Making Using SoP Sets

Computational Example𝑈 = {𝑓 1 , 𝑓 2 , 𝑓 3 , 𝑓 4 , 𝑓 5 , 𝑓 6 }

• Assumption: 𝐹 :

• Soft set 𝐹 :

𝐸 = {𝑒 1 , 𝑒 2 , 𝑒 3 } 𝐹 (𝑒 1 ) = {𝑓 1 , 𝑓 2 , 𝑓 3 , 𝑓 6 } 𝐹 (𝑒 2 ) = {𝑓 1 , 𝑓 4 , 𝑓 6 } 𝐹 (𝑒 3 ) = {𝑓 1 , 𝑓 3 , 𝑓 6 } ( , 𝐹 𝐸) = {(adventure = {𝑓 2 , 𝑓 6 }), (fantasy = {𝑓 1 , 𝑓 4 , 𝑓 6 }), (animation = {𝑓 3 , 𝑓 6 })}

• 𝑐 𝑖 -selected value of object 𝑓 𝑖 ∈ 𝑈

• 𝑑 𝑖𝑗 = 𝑤 𝑗 × 𝑓 𝑖𝑗 -input data of the weighted table , 𝑤 𝑗 ∈ (0, 1] U adventure, 𝑤 1 = 0.8 fantasy, 𝑤 1 = 0.3 animation, 𝑤 1 = 0.9

Selection Value 𝑓 1 1 1 1 𝑐 1 = 2 𝑓 2 1 0 0 𝑐 2 = 0.8 𝑓 3 1 0 1 𝑐 3 = 1.7 𝑓 4 0 1 0 𝑐 4 = 0.3 𝑓 5 0 0 0 𝑐 5 = 0 𝑓 6 1 1 1 𝑐 6 = 2

Table 1

Binary table with weights assigned to parameters, for adventure = 0.8; for fantasy = 0.3; for animation = 0.9

• In the table, it is evident that the films most corresponding to the selection parameters are 𝑓 1 and 𝑓 6 .

• The same calculations are performed for actors and keywords.

Visualization of the recommendation system:

K-nearest neighbours

The KNN algorithm is a simple classifier that consist of finding 𝑘 elements in a given dataset that are most similar to the test element. It follows the steps:

1. Data Collection: Gathering training data, which will be used to build the model. Each data point is represented by a set of features and its corresponding class to be predicted. In this project, data points are movies from the database, class to be predicted is the vote_average value of the movie. 𝑥 − 𝑥 min 𝑥 norm = 𝑥 max − 𝑥 min

Experiments

This chapter focuses on the experiments conducted to develop a machine learning model for movie recommendations. By testing various algorithms, we aim to enhance the accuracy and effectiveness of our recommendation system. Our goal is to bet-ter understand the key factors that contribute to successful movie recommendations, ultimately improving the user experience.

Database description

The dataset utilized in this study was sourced from Kaggle.com, a widely recognized open-access platform renowned for its vast collection of publicly available datasets. Title of database is "TMDb Movies Dataset". There are 10856 records in total, which contain 21 columns.

Evaluation Metric

Model will be measured by using metric accuracy. Accuracy is the most popular metric and it shows how often a classification of an ML model is correct overall.

Where 𝑇𝑃 (True Positives) represent instances that were accurately identified as positive, 𝑇𝑁 (True Negatives) represent instances that were accurately identified as negative, 𝐹𝑁 (False Negatives) are instances where positive cases were incorrectly identified as negative, and 𝐹𝑃 (False Positives) are instances where negative cases were incorrectly identified as positive. Higher 𝑘 values generally lead to improved model performance, indicating more stable predictions as more neighbors are considered. Additionally, standard normalization slightly outperforms min-max normalization, when applied to features like runtime and release year.

Model analysis

In summary, our experiments highlight the effectiveness of higher 𝑘 values and standard normalization in enhancing the predictive performance of our movie recommendation system. These findings emphasize the importance of careful normalization and 𝑘 value selection in predictive modeling tasks.

Conclusion

This paper presents the design of a personalized movie recommendation system using soft set theory and the k-nearest neighbors (KNN) algorithm. The main goal is to build a system that recommends top five movies for a given user according to their preferences. Additional functionalities include suggesting similar movies to a given title, and predicting movie ratings based on data features.

Soft Set Theory provides a powerful tool for dealing with the uncertainty and vagueness associated with user' preferences. Representing user preferences as soft sets allows calculating a total score for each movie and making personalized recommendations that align with users' individual preferences. This shows the flexibility and efficiency of soft sets in decision making processes.

The use of K nearest neighbours algorithm further expands the project. The KNN classifier identifies movies similar to a user-specified title and predicts movie ratings based on ratings of records closest in the feature space. The effectiveness of those predictions was checked. The accuracy oscillates for different values of 𝑘, increasing as 𝑘 increases and reaching the highest value of 44% when 𝑘 equals to 9 (the accuracy for higher values of 𝑘 was not checked).

Experimental results validate the system's effectiveness in generating personalized movie recommendations as recommended movies are, in fact, aligning with provided preferences. The accuracy of KNN classifier reached only 44% because higher values of 𝑘 were not tested and it is hard to predict movie ratings based solely on their features. Soft set theory and KNN have shown to be a potent combination for creating recommendation system that can process diverse user inputs and provide personalized movie suggestions.

• Step 2 :2𝑓 3 , . . . , 𝑓 𝑛 } Define the Set of Parameters 𝐸 Parameters 𝐸 define the attributes relevant to the elements in 𝑈 . These parameters could be movie genres: 𝐸 = {action, drama, comedy, adventure, . . .} • Step 3: Define the Mapping 𝐹

Figure 1 :1Figure 1: Input of example preferences and matching results

2 . 4 .24Determining the Value of Parameter K: The parameter K specifies how many nearest neighbors will be considered during the classification of a new data point. Choosing an appropriate value for K can significantly impact the effectiveness of the model. In the movie recommendation system, 𝑘 takes values from 2 to 9. 3. Calculating Distances: For a new data point whose class is to be predicted, distances to all points in the training set are calculated. This determines the similarity between points. In the project, the Manhattan metric is used: Selecting K Nearest Neighbors: The next step is to select K training points that have the closest distances to the point currently tested. 5. Classifying the Point: After selecting the K nearest neighbors, the point is classified. The method for this is a majority vote, where the class of the new data point is determined by the dominant class among the K nearest neighbors. 6. Determining the accuracy The final step is to assess the performance of the KNN model. This is be done by splitting the data into a training and testing set, and then comparing the predicted classes with the actual classes in the testing set. 7. To avoid the dominance of features with larger values, feature normalization is applied before using KNN.

Figure 2 :2Figure 2: Database

Figure 3 :3Figure 3: Comparing accuracy for different 𝑘 using Standard normalization

Figure 4 :4Figure 4: Comparing accuracy for different 𝑘 using Min-max normalization

Calculate Selection Values Assign weights to each parameter to reflect their importance. Multiply the binary values by these weights and sum them up for each element in 𝑈 . This gives a selection value indicating the relevance of each element based on the given parameters. • Step 7: Determine the Best ChoiceU adventure drama comedy 𝑓 1 1 0 0 1 1 0 𝑓 2 𝑓 3 1 0 1............

• Step 5: Represent Soft Set in a Binary TableThe soft set can be represented in a binary table for easier analysis. Each row corresponds to an element in 𝑈 , and each column corresponds to a parameter in 𝐸. An entry is 1 if the element is associated with the parameter, and 0 otherwise.• Step 6:

𝑈 = {𝑓 1 , 𝑓 2 , 𝑓 3 , . . . , 𝑓 𝑛 } • Set of parameters 𝐸 defining movie genres: 𝐸 = {action, drama, crime, adventure, science-fiction, thriller, fantasy, western, animation, . . .}• Set of considered parameters 𝐴:𝐴 = {adventure, fantasy, animation}• There are 6 films in the class 𝑈 :

• Class 𝑈 :where 𝑓 𝑖 are films.

Soft set theory -First results DMolodtsov Computers & Mathematics with Applications 37 4-5 1999 Introduction to Recommender Systems Handbook FRicci LRokach BShapira Recommender Systems Handbook

Boston, MA

Springer 2011 Content-based Recommender Systems: State of the Art and Trends PLops MDeGemmis GSemeraro Recommender Systems Handbook

Boston, MA

Springer 2011 Tmdb The Movie Database