=Paper=
{{Paper
|id=Vol-3885/paper42
|storemode=property
|title=Movie Recommendation System
|pdfUrl=https://ceur-ws.org/Vol-3885/paper42.pdf
|volume=Vol-3885
|authors=Weronika Wołowczyk,Ewa Szymik
|dblpUrl=https://dblp.org/rec/conf/ivus/WolowczykS24
}}
==Movie Recommendation System==
Movie Recommendation System*
Weronika Wołowczyk1,∗,†, Ewa Szymik1,†
Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND
Abstract
The goal of the project is to deliver personalized movie suggestions based on user preferences by
analyzing and processing a dataset of movies. The project’s primary stages are data cleaning (as some
unnecessary dataset columns were removed), exploratory data analysis (several visualizations of dataset
characteristics were presented), and creating a recommendation system based on a soft-set theory.
The core of the project is a recommendation system that makes movie suggestions based on user
input. Users are asked to state their preferences concerning actors, genres, and keywords. Then, a soft
set-based classification method is applied to score and rank the films depending on these preferences.
The system calculates a total score for each movie based on its attributes, ultimately providing the top
five propositions.
There are also introduced methods for recommending 5 most similar movies to a given title and
predicting movie ratings based on their features using k-nearest neighbor (knn) algorithm. In the first
method, the algorithm searches for most similar movies based on their attributes and in the second it
predicts a movie’s rating by analyzing the votes of the k films with the most similar features.
Overall, the project presents the application of algorithmic techniques and machine learning methods such
as soft-sets, to provide personalized suggestions, and k nearest neighbours algorithm to analyse data
and predict data attributes.
Keywords
movie recommendation, softset, data analysis, personalized recommendations, data preprocessing, KNN
algorithm, vote predictions
1. Introduction
This increase in the number of films being produced each year creates a difficulty for the viewers, to
chose a movie that suites the most to their liking. With all of the streaming services, users
have a rich library of content available to them, and picking and choosing what to watch may
get more and more complicated. That is why there is a need for recommendation systems: to
provide personalized content and, as a result, to improve the user experience by recommending only
the movies that might spark an interest and match the viewers’ individual preferences.
Collaborative filtering, content-based recommendation and hybrid methods are the typical
techniques that are employed by many existing recommendation systems. Collaborative filtering
depends on preferences of other like-minded users whereas content-based filtering recommends items
to the user based on descriptions of items. Hybrid methods combine both approaches to leverage
their strengths and mitigate their weaknesses.
*IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania
1,∗
CEUR
ceur-ws.org Corresponding author
Workshop †
These author contributed equally.
ISSN 1613-0073
Proceedings
ww308053@student.polsl.pl (W. Wołowczyk); es308045@student.polsl.pl (E. Szymik)
©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
This paper explores the use of soft set theory as an approach to movie recommendation
systems. Soft sets, introduced by Molodtsov in 1999, are mathematical models used for reasoning
under conditions of uncertainty and vagueness. Unlike traditional sets, where an element
either belongs to the set or does not, soft sets allow for partial membership, with elements
having degrees of belonging. This degree is typically represented by a value between 0 and 1,
indicating how strongly an element is associated with the set. Soft sets are particularly useful in
fields such as decision-making and artificial intelligence, where uncertainty and vagueness are
common. In the context of movie selection, they provide a flexible classification method that
can accommodate the varied nature of user preferences.
The KNN (k-nearest neighbors) algorithm is a simple, non-parametric method for classification
tasks in machine learning. It operates on the principle of proximity: it consists of finding 𝑘
closest objects in feature space to the element currently being tested . Therefore, regarding
feature similarity, they are called neighbours. Neighbors are derived from a set of objects used to
train the algorithm. The resulting class is the one in which there is the highest number of
neighbours. Most often, the distance between the elements is calculated using Euclidean or
Manhattan metric.
The KNN classifier is used, firstly, to recommend 5 most similar movies to the one provided by
a user, and secondly, to predict movie ratings based on the 𝑣𝑜𝑡𝑒_𝑎𝑣𝑒𝑟𝑎𝑔𝑒 attribute of similar movies. The steps of
the algorithm are normalizing the data, splitting the data into training set and test set, fitting the
model, and evaluating the accuracy of the predictions to determine the effectiveness of the model.
Overall, main point of this project is the development of a personalized recommendation
system that leverages user-defined preferences for genres, actors, and keywords. By applying
soft set theory, the system calculates a total score for each movie based on its alignment with
user preferences and displays the five films with the highest mark, resulting in customized
movie suggestions. Second point is the recommendation of 5 most similar movies. Users are
asked to input a title and then the system finds similar movies feature-wise. The prediction of
movie ratings based on their features is the third point. The system finds similar movies in the
training set and based on their 𝑣𝑜𝑡𝑒_𝑎𝑣𝑒𝑟𝑎𝑔𝑒 attribute predicts the rating of movies from test set. Both
points use K-nearest neighbours algorithm.
2. Methodology
Soft set methodology offers a flexible approach for handling uncertainties and making decisions based
on multiple parameters. Its simplicity and adaptability make it a powerful tool for various
applications, including recommendation systems.
2.1. SoP Set
A soft set (𝐹, 𝐸) over a universal set 𝑈 is a pair where 𝐹 is a mapping given by 𝐹 : 𝐸 → 𝑃 (𝑈 ).
Here, 𝐸 is a set of parameters, and 𝑃 (𝑈 ) denotes the power set of 𝑈 . For each parameter 𝑒 ∈ 𝐸,
𝐹 (𝑒) is a subset of 𝑈 .
2.2. Mathematical Model
• Step 1: Define the Universal Set 𝑈
Let 𝑈 represent a universal set containing elements that need to be analyzed and catego-
rized. In a movie recommendation system, 𝑈 is a set of films:
𝑈 = {𝑓1, 𝑓2, 𝑓3, . . . , 𝑓𝑛}
• Step 2: Define the Set of Parameters 𝐸
Parameters 𝐸 define the attributes relevant to the elements in 𝑈 . These parameters could be
movie genres:
𝐸 = {action, drama, comedy, adventure, . . .}
• Step 3: Define the Mapping 𝐹
The mapping 𝐹 associates each parameter 𝑒 ∈ 𝐸 with a subset of 𝑈 . For instance, if the
parameter is "adventure," 𝐹 (adventure) might include films classified as adventure films:
𝐹 (adventure) = {𝑓1, 𝑓2, 𝑓3}
𝐹 (drama) = {𝑓2, 𝑓4}
2.3. Constructing the SoP Set
• Step 4: Construct the Soft Set (𝐹, 𝐸)
The soft set is constructed by pairing each parameter with its corresponding subset:
(𝐹, 𝐸) = {(adventure, {𝑓1, 𝑓2, 𝑓3}), (drama, {𝑓2, 𝑓4}), . . .}
2.4. Decision-Making Using SoP Sets
• Step 5: Represent Soft Set in a Binary Table
The soft set can be represented in a binary table for easier analysis. Each row corresponds to an
element in 𝑈 , and each column corresponds to a parameter in 𝐸. An entry is 1 if the
element is associated with the parameter, and 0 otherwise.
U adventure drama comedy
𝑓1 1 0 0
𝑓2 1 1 0
𝑓3 1 0 1
... ... ... ...
• Step 6: Calculate Selection Values
Assign weights to each parameter to reflect their importance. Multiply the binary values by
these weights and sum them up for each element in 𝑈 . This gives a selection value
indicating the relevance of each element based on the given parameters.
• Step 7: Determine the Best Choice
The elements with the highest selection values are considered the best choices based on the
parameters. This can be used for recommendations.
2.5. Computational Example
• Class 𝑈 :
𝑈 = {𝑓1, 𝑓2, 𝑓3, . . . , 𝑓𝑛}
where 𝑓𝑖 are films.
• Set of parameters 𝐸 defining movie genres:
𝐸 = {action, drama, crime, adventure, science-fiction, thriller, fantasy, western, animation, . . .}
• Set of considered parameters 𝐴:
𝐴 = {adventure, fantasy, animation}
• There are 6 films in the class 𝑈 :
𝑈 = {𝑓1, 𝑓2, 𝑓3, 𝑓4, 𝑓5, 𝑓6}
𝐸 = {𝑒1, 𝑒2, 𝑒3}
• Assumption: 𝐹 :
𝐹 (𝑒1) = {𝑓1, 𝑓2, 𝑓3, 𝑓6}
𝐹 (𝑒2) = {𝑓1, 𝑓4, 𝑓6}
𝐹 (𝑒3) = {𝑓1, 𝑓3, 𝑓6}
• Soft set 𝐹 :
(𝐹, 𝐸) = {(adventure = {𝑓2, 𝑓6}), (fantasy = {𝑓1, 𝑓4, 𝑓6}), (animation = {𝑓3, 𝑓6})}
• 𝑐𝑖- selected value of object 𝑓𝑖 ∈ 𝑈
• 𝑑𝑖𝑗 = 𝑤𝑗 × 𝑓𝑖𝑗 - input data of the weighted table , 𝑤𝑗 ∈ (0, 1]
U adventure, 𝑤1 = 0.8 fantasy, 𝑤1 = 0.3 animation, 𝑤1 = 0.9 Selection Value
𝑓1 1 1 1 𝑐1 = 2
𝑓2 1 0 0 𝑐2 = 0.8
𝑓3 1 0 1 𝑐3 = 1.7
𝑓4 0 1 0 𝑐4 = 0.3
𝑓5 0 0 0 𝑐5 = 0
𝑓6 1 1 1 𝑐6 = 2
Table 1
Binary table with weights assigned to parameters, for adventure = 0.8; for fantasy = 0.3; for animation =
0.9
• In the table, it is evident that the films most corresponding to the selection parameters are
𝑓1 and 𝑓6.
• The same calculations are performed for actors and keywords.
Visualization of the recommendation system:
Figure 1: Input of example preferences and matching results
2.6. K-nearest neighbours
The KNN algorithm is a simple classifier that consist of finding 𝑘 elements in a given
dataset that are most similar to the test element. It follows the steps:
1. Data Collection: Gathering training data, which will be used to build the
model. Each data point is represented by a set of features and its corresponding
class to be predicted. In this project, data points are movies from the database,
class to be predicted is the vote_average value of the movie.
2. Determining the Value of Parameter K: The parameter K specifies how
many nearest neighbors will be considered during the classification of a new data
point. Choosing an appropriate value for K can significantly impact the
effectiveness of the model. In the movie recommendation system, 𝑘 takes values
from 2 to 9.
3. Calculating Distances: For a new data point whose class is to be predicted,
distances to all points in the training set are calculated. This determines the
similarity between points. In the project, the Manhattan metric is used:
4. Selecting K Nearest Neighbors: The next step is to select K training points
that have the closest distances to the point currently tested.
5. Classifying the Point: After selecting the K nearest neighbors, the point is
classified. The method for this is a majority vote, where the class of the new data
point is determined by the dominant class among the K nearest neighbors.
6. Determining the accuracy The final step is to assess the performance of the
KNN model. This is be done by splitting the data into a training and testing set,
and then comparing the predicted classes with the actual classes in the testing set.
7. To avoid the dominance of features with larger values, feature normalization is
applied before using KNN.
𝑥 − 𝑥min
𝑥norm =
𝑥max − 𝑥min
3. Experiments
This chapter focuses on the experiments conducted to develop a machine learning
model for movie recommendations. By testing various algorithms, we aim to enhance
the accuracy and effectiveness of our recommendation system. Our goal is to bet- ter
understand the key factors that contribute to successful movie recommendations,
ultimately improving the user experience.
3.1. Database description
The dataset utilized in this study was sourced from Kaggle.com, a widely recognized
open-access platform renowned for its vast collection of publicly available datasets.
Title of database is "TMDb Movies Dataset". There are 10856 records in total, which
contain 21 columns.
Figure 2: Database
3.2. Evaluation Metric
Model will be measured by using metric accuracy. Accuracy is the most popular metric
and it shows how often a classification of an ML model is correct overall.
Where 𝑇𝑃 (True Positives) represent instances that were accurately identified as
positive, 𝑇𝑁 (True Negatives) represent instances that were accurately identified as
negative, 𝐹𝑁 (False Negatives) are instances where positive cases were incorrectly
identified as negative, and 𝐹𝑃 (False Positives) are instances where negative cases
were incorrectly identified as positive.
3.3. Model analysis
Figure 3: Comparing accuracy for different 𝑘 using Standard normalization
Figure 4: Comparing accuracy for different 𝑘 using Min-max normalization
The figures above depict the accuracy of our KNN model across different 𝑘 values
using standard and min-max normalization techniques. Our target variable, the rounded vote
average, poses a challenge due to its unpredictability.
Higher 𝑘 values generally lead to improved model performance, indicating more stable
predictions as more neighbors are considered. Additionally, standard normalization
slightly outperforms min-max normalization, when applied to features like runtime
and release year.
In summary, our experiments highlight the effectiveness of higher 𝑘 values and
standard normalization in enhancing the predictive performance of our movie recom-
mendation system. These findings emphasize the importance of careful normalization and
𝑘 value selection in predictive modeling tasks.
4. Conclusion
This paper presents the design of a personalized movie recommendation system using
soft set theory and the k-nearest neighbors (KNN) algorithm. The main goal is to
build a system that recommends top five movies for a given user according to their
preferences. Additional functionalities include suggesting similar movies to a given
title, and predicting movie ratings based on data features.
Soft Set Theory provides a powerful tool for dealing with the uncertainty and vague-
ness associated with user’ preferences. Representing user preferences as soft sets allows
calculating a total score for each movie and making personalized recommendations
that align with users’ individual preferences. This shows the flexibility and efficiency
of soft sets in decision making processes.
The use of K nearest neighbours algorithm further expands the project. The KNN
classifier identifies movies similar to a user-specified title and predicts movie ratings
based on ratings of records closest in the feature space. The effectiveness of those
predictions was checked. The accuracy oscillates for different values of 𝑘, increasing as
𝑘 increases and reaching the highest value of 44% when 𝑘 equals to 9 (the accuracy for higher
values of 𝑘 was not checked).
Experimental results validate the system’s effectiveness in generating personalized
movie recommendations as recommended movies are, in fact, aligning with provided
preferences. The accuracy of KNN classifier reached only 44% because higher values of
𝑘 were not tested and it is hard to predict movie ratings based solely on their features.
Soft set theory and KNN have shown to be a potent combination for creating recom-
mendation system that can process diverse user inputs and provide personalized movie
suggestions.
References
[1] D. Molodtsov, “Soft set theory — First results,” Computers & Mathematics with
Applications, vol. 37, no. 4-5, pp. 19-31, 1999.
[2] F. Ricci, L. Rokach, and B. Shapira, “Introduction to Recommender Systems Hand-
book,” in Recommender Systems Handbook, Springer, Boston, MA, 2011, pp. 1-35.
[3] P. Lops, M. De Gemmis, and G. Semeraro, “Content-based Recommender Systems:
State of the Art and Trends,” in Recommender Systems Handbook, Springer, Boston,
MA, 2011, pp. 73-105.
[4] TMDb, The Movie Database, available at: https://www.kaggle.com/datasets/
juzershakir/tmdb-movies-dataset/data