Movie Recommendation System*
                                Weronika Wołowczyk1,∗,†, Ewa Szymik1,†
                                Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND


                                                 Abstract
                                                 The goal of the project is to deliver personalized movie suggestions based on user preferences by
                                                 analyzing and processing a dataset of movies. The project’s primary stages are data cleaning (as some
                                                 unnecessary dataset columns were removed), exploratory data analysis (several visualizations of dataset
                                                 characteristics were presented), and creating a recommendation system based on a soft-set theory.
                                                     The core of the project is a recommendation system that makes movie suggestions based on user
                                                 input. Users are asked to state their preferences concerning actors, genres, and keywords. Then, a soft
                                                 set-based classification method is applied to score and rank the films depending on these preferences.
                                                 The system calculates a total score for each movie based on its attributes, ultimately providing the top
                                                 five propositions.
                                                     There are also introduced methods for recommending 5 most similar movies to a given title and
                                                 predicting movie ratings based on their features using k-nearest neighbor (knn) algorithm. In the first
                                                 method, the algorithm searches for most similar movies based on their attributes and in the second it
                                                 predicts a movie’s rating by analyzing the votes of the k films with the most similar features.
                                                     Overall, the project presents the application of algorithmic techniques and machine learning methods such
                                                 as soft-sets, to provide personalized suggestions, and k nearest neighbours algorithm to analyse data
                                                 and predict data attributes.
                                                 Keywords
                                                 movie recommendation, softset, data analysis, personalized recommendations, data preprocessing, KNN
                                                 algorithm, vote predictions


                                1. Introduction
                                This increase in the number of films being produced each year creates a difficulty for the viewers, to
                                chose a movie that suites the most to their liking. With all of the streaming services, users
                                have a rich library of content available to them, and picking and choosing what to watch may
                                get more and more complicated. That is why there is a need for recommendation systems: to
                                provide personalized content and, as a result, to improve the user experience by recommending only
                                the movies that might spark an interest and match the viewers’ individual preferences.
                                   Collaborative filtering, content-based recommendation and hybrid methods are the typical
                                techniques that are employed by many existing recommendation systems. Collaborative filtering
                                depends on preferences of other like-minded users whereas content-based filtering recommends items
                                to the user based on descriptions of items. Hybrid methods combine both approaches to leverage
                                their strengths and mitigate their weaknesses.


                                *IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania
                                1,∗
CEUR
                  ceur-ws.org         Corresponding author
Workshop                        †
                                      These author contributed equally.
              ISSN 1613-0073
Proceedings

                                       ww308053@student.polsl.pl (W. Wołowczyk); es308045@student.polsl.pl (E. Szymik)
                                               ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
   This paper explores the use of soft set theory as an approach to movie recommendation
systems. Soft sets, introduced by Molodtsov in 1999, are mathematical models used for reasoning
under conditions of uncertainty and vagueness. Unlike traditional sets, where an element
either belongs to the set or does not, soft sets allow for partial membership, with elements
having degrees of belonging. This degree is typically represented by a value between 0 and 1,
indicating how strongly an element is associated with the set. Soft sets are particularly useful in
fields such as decision-making and artificial intelligence, where uncertainty and vagueness are
common. In the context of movie selection, they provide a flexible classification method that
can accommodate the varied nature of user preferences.
   The KNN (k-nearest neighbors) algorithm is a simple, non-parametric method for classification
tasks in machine learning. It operates on the principle of proximity: it consists of finding 𝑘
closest objects in feature space to the element currently being tested . Therefore, regarding
feature similarity, they are called neighbours. Neighbors are derived from a set of objects used to
train the algorithm. The resulting class is the one in which there is the highest number of
neighbours. Most often, the distance between the elements is calculated using Euclidean or
Manhattan metric.
   The KNN classifier is used, firstly, to recommend 5 most similar movies to the one provided by
a user, and secondly, to predict movie ratings based on the 𝑣𝑜𝑡𝑒_𝑎𝑣𝑒𝑟𝑎𝑔𝑒 attribute of similar movies. The steps of
the algorithm are normalizing the data, splitting the data into training set and test set, fitting the
model, and evaluating the accuracy of the predictions to determine the effectiveness of the model.
   Overall, main point of this project is the development of a personalized recommendation
system that leverages user-defined preferences for genres, actors, and keywords. By applying
soft set theory, the system calculates a total score for each movie based on its alignment with
user preferences and displays the five films with the highest mark, resulting in customized
movie suggestions. Second point is the recommendation of 5 most similar movies. Users are
asked to input a title and then the system finds similar movies feature-wise. The prediction of
movie ratings based on their features is the third point. The system finds similar movies in the
training set and based on their 𝑣𝑜𝑡𝑒_𝑎𝑣𝑒𝑟𝑎𝑔𝑒 attribute predicts the rating of movies from test set. Both
points use K-nearest neighbours algorithm.

2. Methodology
Soft set methodology offers a flexible approach for handling uncertainties and making decisions based
on multiple parameters. Its simplicity and adaptability make it a powerful tool for various
applications, including recommendation systems.

2.1. SoP Set
A soft set (𝐹, 𝐸) over a universal set 𝑈 is a pair where 𝐹 is a mapping given by 𝐹 : 𝐸 → 𝑃 (𝑈 ).
Here, 𝐸 is a set of parameters, and 𝑃 (𝑈 ) denotes the power set of 𝑈 . For each parameter 𝑒 ∈ 𝐸,
𝐹 (𝑒) is a subset of 𝑈 .
2.2. Mathematical Model
   • Step 1: Define the Universal Set 𝑈
     Let 𝑈 represent a universal set containing elements that need to be analyzed and catego-
     rized. In a movie recommendation system, 𝑈 is a set of films:
                                           𝑈 = {𝑓1, 𝑓2, 𝑓3, . . . , 𝑓𝑛}
   • Step 2: Define the Set of Parameters 𝐸
     Parameters 𝐸 define the attributes relevant to the elements in 𝑈 . These parameters could be
     movie genres:
                            𝐸 = {action, drama, comedy, adventure, . . .}
   • Step 3: Define the Mapping 𝐹
     The mapping 𝐹 associates each parameter 𝑒 ∈ 𝐸 with a subset of 𝑈 . For instance, if the
     parameter is "adventure," 𝐹 (adventure) might include films classified as adventure films:
                                         𝐹 (adventure) = {𝑓1, 𝑓2, 𝑓3}
                                             𝐹 (drama) = {𝑓2, 𝑓4}

2.3. Constructing the SoP Set
   • Step 4: Construct the Soft Set (𝐹, 𝐸)
     The soft set is constructed by pairing each parameter with its corresponding subset:
                         (𝐹, 𝐸) = {(adventure, {𝑓1, 𝑓2, 𝑓3}), (drama, {𝑓2, 𝑓4}), . . .}

2.4. Decision-Making Using SoP Sets
   • Step 5: Represent Soft Set in a Binary Table
     The soft set can be represented in a binary table for easier analysis. Each row corresponds to an
     element in 𝑈 , and each column corresponds to a parameter in 𝐸. An entry is 1 if the
     element is associated with the parameter, and 0 otherwise.

                                U     adventure      drama       comedy
                                𝑓1        1            0            0
                                𝑓2        1            1            0
                                𝑓3        1            0            1
                                ...       ...          ...         ...

   • Step 6: Calculate Selection Values
     Assign weights to each parameter to reflect their importance. Multiply the binary values by
     these weights and sum them up for each element in 𝑈 . This gives a selection value
     indicating the relevance of each element based on the given parameters.
   • Step 7: Determine the Best Choice
     The elements with the highest selection values are considered the best choices based on the
     parameters. This can be used for recommendations.
2.5. Computational Example
   • Class 𝑈 :
                                            𝑈 = {𝑓1, 𝑓2, 𝑓3, . . . , 𝑓𝑛}
     where 𝑓𝑖 are films.
   • Set of parameters 𝐸 defining movie genres:

     𝐸 = {action, drama, crime, adventure, science-fiction, thriller, fantasy, western, animation, . . .}
   • Set of considered parameters 𝐴:

                                      𝐴 = {adventure, fantasy, animation}
   • There are 6 films in the class 𝑈 :

                                               𝑈 = {𝑓1, 𝑓2, 𝑓3, 𝑓4, 𝑓5, 𝑓6}

                                                    𝐸 = {𝑒1, 𝑒2, 𝑒3}
   • Assumption: 𝐹 :
                                                 𝐹 (𝑒1) = {𝑓1, 𝑓2, 𝑓3, 𝑓6}
                                                 𝐹 (𝑒2) = {𝑓1, 𝑓4, 𝑓6}
                                                 𝐹 (𝑒3) = {𝑓1, 𝑓3, 𝑓6}

   • Soft set 𝐹 :

       (𝐹, 𝐸) = {(adventure = {𝑓2, 𝑓6}), (fantasy = {𝑓1, 𝑓4, 𝑓6}), (animation = {𝑓3, 𝑓6})}
   • 𝑐𝑖- selected value of object 𝑓𝑖 ∈ 𝑈


   • 𝑑𝑖𝑗 = 𝑤𝑗 × 𝑓𝑖𝑗 - input data of the weighted table , 𝑤𝑗 ∈ (0, 1]
       U     adventure, 𝑤1 = 0.8      fantasy, 𝑤1 = 0.3       animation, 𝑤1 = 0.9   Selection Value
       𝑓1            1                       1                        1                   𝑐1 = 2
       𝑓2            1                       0                        0                  𝑐2 = 0.8
       𝑓3            1                       0                        1                  𝑐3 = 1.7
       𝑓4            0                       1                        0                  𝑐4 = 0.3
       𝑓5            0                       0                        0                   𝑐5 = 0
       𝑓6            1                       1                        1                   𝑐6 = 2
   Table 1
   Binary table with weights assigned to parameters, for adventure = 0.8; for fantasy = 0.3; for animation =
   0.9

    • In the table, it is evident that the films most corresponding to the selection parameters are
      𝑓1 and 𝑓6.
    • The same calculations are performed for actors and keywords.

Visualization of the recommendation system:


Figure 1: Input of example preferences and matching results
2.6. K-nearest neighbours
The KNN algorithm is a simple classifier that consist of finding 𝑘 elements in a given
dataset that are most similar to the test element. It follows the steps:
  1. Data Collection: Gathering training data, which will be used to build the
     model. Each data point is represented by a set of features and its corresponding
     class to be predicted. In this project, data points are movies from the database,
     class to be predicted is the vote_average value of the movie.
  2. Determining the Value of Parameter K: The parameter K specifies how
     many nearest neighbors will be considered during the classification of a new data
     point. Choosing an appropriate value for K can significantly impact the
     effectiveness of the model. In the movie recommendation system, 𝑘 takes values
     from 2 to 9.
  3. Calculating Distances: For a new data point whose class is to be predicted,
     distances to all points in the training set are calculated. This determines the
     similarity between points. In the project, the Manhattan metric is used:


  4. Selecting K Nearest Neighbors: The next step is to select K training points
     that have the closest distances to the point currently tested.
  5. Classifying the Point: After selecting the K nearest neighbors, the point is
     classified. The method for this is a majority vote, where the class of the new data
     point is determined by the dominant class among the K nearest neighbors.
  6. Determining the accuracy The final step is to assess the performance of the
     KNN model. This is be done by splitting the data into a training and testing set,
     and then comparing the predicted classes with the actual classes in the testing set.
  7. To avoid the dominance of features with larger values, feature normalization is
     applied before using KNN.
                                                    𝑥 − 𝑥min
                                      𝑥norm =
                                                𝑥max − 𝑥min

3. Experiments
This chapter focuses on the experiments conducted to develop a machine learning
model for movie recommendations. By testing various algorithms, we aim to enhance
the accuracy and effectiveness of our recommendation system. Our goal is to bet- ter
understand the key factors that contribute to successful movie recommendations,
ultimately improving the user experience.
3.1. Database description
The dataset utilized in this study was sourced from Kaggle.com, a widely recognized
open-access platform renowned for its vast collection of publicly available datasets.
Title of database is "TMDb Movies Dataset". There are 10856 records in total, which
contain 21 columns.


Figure 2: Database


3.2. Evaluation Metric
Model will be measured by using metric accuracy. Accuracy is the most popular metric
and it shows how often a classification of an ML model is correct overall.


  Where 𝑇𝑃 (True Positives) represent instances that were accurately identified as
positive, 𝑇𝑁 (True Negatives) represent instances that were accurately identified as
negative, 𝐹𝑁 (False Negatives) are instances where positive cases were incorrectly
identified as negative, and 𝐹𝑃 (False Positives) are instances where negative cases
were incorrectly identified as positive.


3.3. Model analysis


Figure 3: Comparing accuracy for different 𝑘 using Standard normalization


Figure 4: Comparing accuracy for different 𝑘 using Min-max normalization


   The figures above depict the accuracy of our KNN model across different 𝑘 values
using standard and min-max normalization techniques. Our target variable, the rounded vote
average, poses a challenge due to its unpredictability.
   Higher 𝑘 values generally lead to improved model performance, indicating more stable
predictions as more neighbors are considered. Additionally, standard normalization
slightly outperforms min-max normalization, when applied to features like runtime
and release year.
   In summary, our experiments highlight the effectiveness of higher 𝑘 values and
standard normalization in enhancing the predictive performance of our movie recom-
mendation system. These findings emphasize the importance of careful normalization and
𝑘 value selection in predictive modeling tasks.
4. Conclusion
This paper presents the design of a personalized movie recommendation system using
soft set theory and the k-nearest neighbors (KNN) algorithm. The main goal is to
build a system that recommends top five movies for a given user according to their
preferences. Additional functionalities include suggesting similar movies to a given
title, and predicting movie ratings based on data features.
   Soft Set Theory provides a powerful tool for dealing with the uncertainty and vague-
ness associated with user’ preferences. Representing user preferences as soft sets allows
calculating a total score for each movie and making personalized recommendations
that align with users’ individual preferences. This shows the flexibility and efficiency
of soft sets in decision making processes.
   The use of K nearest neighbours algorithm further expands the project. The KNN
classifier identifies movies similar to a user-specified title and predicts movie ratings
based on ratings of records closest in the feature space. The effectiveness of those
predictions was checked. The accuracy oscillates for different values of 𝑘, increasing as
𝑘 increases and reaching the highest value of 44% when 𝑘 equals to 9 (the accuracy for higher
values of 𝑘 was not checked).
   Experimental results validate the system’s effectiveness in generating personalized
movie recommendations as recommended movies are, in fact, aligning with provided
preferences. The accuracy of KNN classifier reached only 44% because higher values of
    𝑘 were not tested and it is hard to predict movie ratings based solely on their features.
   Soft set theory and KNN have shown to be a potent combination for creating recom-
       mendation system that can process diverse user inputs and provide personalized movie
suggestions.

References
 [1] D. Molodtsov, “Soft set theory — First results,” Computers & Mathematics with
     Applications, vol. 37, no. 4-5, pp. 19-31, 1999.
 [2] F. Ricci, L. Rokach, and B. Shapira, “Introduction to Recommender Systems Hand-
     book,” in Recommender Systems Handbook, Springer, Boston, MA, 2011, pp. 1-35.
 [3] P. Lops, M. De Gemmis, and G. Semeraro, “Content-based Recommender Systems:
     State of the Art and Trends,” in Recommender Systems Handbook, Springer, Boston,
     MA, 2011, pp. 73-105.
 [4] TMDb, The Movie Database, available at: https://www.kaggle.com/datasets/
     juzershakir/tmdb-movies-dataset/data