Enhancing Movie Recommenders by means of KNN-based Algorithms Kamil Rojek1 , Rafał Ochorok1 and Maciej Wiencis1 1 Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44-100 Gliwice, Poland Abstract The project concerns a system recommending films using the KNN algorithm. The program in order to find movies, is based on the history of viewed items. Movie Recommender initially chooses the best ones from the history of films, in order to finally give the proposals best suited to the user’s preferences. The model works with data from IMDB [? ] data set downloaded from datasets.imdbws.com. Keywords Movies, Prediction model, Classification algorithms, Personalized movie prediction, Python 1. Introduction 2. Assumptions for algorithms Currently, the most dynamically developing IT tools are Each of the algorithms should be prepared to meet the methods of artificial intelligence [1, 2]. Algorithms sup- following criteria: porting decision-making or supporting inference based 1. Prepared according to the mathematical descrip- on fuzzy sets [3, 4] find a number of applications, among tion of the algorithm; others, in the detection of anomalies on roads [5] or in the control of intelligent home management systems [6, 7]. 2. Optimized for the performance on our data set; At this point, one cannot fail to mention a wide class of 3. Returns an array containing information about heuristic algorithms based on the observation of animal top k (number of predicted movies) movies, bas- behavior [8, 9, 10], which are widely used. The energy ing on top 3 movies from our watch-list. reduction applications [11, 12, 13, 14, 15] are very im- portant. The most common applications relate to the 3. Program description use of [16] neural networks in a wide variety of applica- tions that affect almost every area of life [17, 18, 19, 20]. The task of our project is to create a system recommend- Very interesting applications concern the care of the el- ing films using the KNN algorithm. The program in order derly [21, 22, 23, 24]. Often, neural networks are used to find movies, is based on the history of viewed items. in various types of detection tasks for certain features Movie Recommender initially chooses the best ones from [25, 26, 27, 28]. The use of neural networks also plays a the history of films. Then each video goes through the very important role in machine learning [29, 30]. algorithm so that the program finally gives the proposals Due to digitization of our modern world prediction best suited to the user’s preferences. (Including watch models are extremely crucial in these days. That’s be- history). In this particular cases, KNN uses three metrics: cause they can optimize some of the user’s processes, that Taxi cab, Cosine distances and Euclidean. would facilitate comfort of using a given app. Movies are extremely complex thanks to a lot of variables into which they can be divided. People struggle with choosing 4. KNN history a movie to watch, because they not only might not be aware of their preferences, but also they may not have The origins of KNN can be traced to research conducted enough time to check and compare all data[31]. Whole for the U.S armed forces. Evelyn Fix (1904-1965) was problem could be solved by a program doing all the neces- a mathematician and statistician who taught at Berke- sary calculations for you, basing on user’s watch history ley. Joseph Lawson Hodges Jr. (1922-2000) was a Berke- and reviews. ley statistician who worked with the 20 United States Air Forces (USAF) from 1944. Combining their brilliant minds, in 1951 they wrote a technical analysis report for SYSYEM 2022: 8th Scholar’s Yearly Symposium of Technology, Engi- the USAF. He introduced a discriminant analysis, non- neering and Mathematics, Brunek, July 23, 2022 " kamiroj@polsl.pl (K. Rojek); rafaoch606@polsl.pl (R. Ochorok); parametric classification method. However, the newspa- maciwie234@polsl.pl (M. Wiencis) per was never officially published - most likely due to © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). confidentiality in the aftermath of World War II. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 49 Kamil Rojek et al. CEUR Workshop Proceedings 49–54 5. Euclidean metrics history The next step is to create a person’s profile to keep a history of the videos watched along with the user’s Euclidean distance is the distance in Euclidean space; rating. Before starting the algorithm, the program selects both concepts are named after ancient Greek mathemati- three top movies according to the user’s rating. Then, cian Euclid, whose Elements became a standard textbook based on this data, it performs calculations to find the in geometry for many centuries.Concepts of length and best matching items in our database. distance are widespread across cultures, can be dated to The metric used in the KNN algorithm is the sum of the the earliest surviving "protoliterate" bureaucratic doc- cosine, taxicab and euclidean distances,between the val- uments from Sumer in the fourth millennium BC (far ues of the film elements we compare, i.e. genres, writers, before Euclid),and have been hypothesized to develop in directors. We use previously created numerical values. children earlier than the related concepts of speed and Formulas used to determine distances between successive time.But the notion of a distance, as a number defined parameters looks like this: from two points, does not actually appear in Euclid’s Ele- • Cosine distance ments. Instead, Euclid approaches this concept implicitly, 𝑢·𝑣 through the congruence of line segments, through the 1− , (1) ‖𝑢‖2 ‖𝑣‖2 comparison of lengths of line segments, and through the concept of proportionality. • Taxi cab distance |𝑢 − 𝑣| (2) 6. Taxi Cab metrics history • Euclidean distance ⎯ ⎸ 𝑛 Taxicab Geometry is a non-Euclidean Geometry that mea- ⎸∑︁ ⎷ (𝑢 − 𝑣)2 (3) sures distance on horizontal and vertical lines. According 𝑖=1 to Taxicab Geometry - History, the taxicab metric was where u i v are the arrays to be compared first introduced by Hermann Minkowski (1864-1909) over 100 years ago; however, it did not get its name until 1952. Taxicab is unique in that it is only one axiom away from The next step is to use the KNN algorithm, which will being a Euclidean metric. In Euclidean Geometry the use the previously described metric, in order to find the minimum distance between two points is the shortest k-nearest neighbors of a given movie. In the algorithm it- line segment between those two points. However, in Taxi- self, we predict finding k neighbors. As the algorithm can cab Geometry there can be multiple minimal distances or receive a maximum of 3 user top videos, it will therefore ‘shortest paths’ made up of line segments perpendicular return a top 3k of the proposed positions. For example, or parallel to the x-axis. Taxicab Geometry - History if we add movies to the history: suggests that modern research on taxicab did not occur • Coffee & Kareem, rating: 8 until as recent as the 1980s. The measurement of distance • Das Cabinet des Dr. Caligari, rating: 9 using vertical and horizontal lines rather than diagonal • The Kid, rating: 8.5 lines has sparked questions about its applications and Program would output: encouraged more research and exploration of this simple Recommended Movies basing on: Coffee & Kareem yet unique metric • The F word 7. Description of the program’s • It’s All Gone Pete Tong operation Recommended Movies basing on: Das Cabinet des Dr. Caligari Initially, we started the project by preparing the data in such a way that we could then carry out calculations • Psycho on them. For this purpose, we downloaded the IMDB • 6 donne per 1’assassino database, which consisted of four files containing data Recommended Movies basing on: The Kid on: • The Circus • Data on the movie itself. • Modern Times • Ratings for individual videos. • The cast of the movie. • Personal data of people participating in the film. 8. Algorithms In order to optimize the algorithm, we do not use the full In this section we will present pseudocodes of the most names of the cast at this stage of operation. important algorithms used by us. 50 Kamil Rojek et al. CEUR Workshop Proceedings 49–54 Data: Input: Id of the first movie 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1, Id of Data: Input: Id of the first movie 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1, Id of the second movie 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 the second movie 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 Result: The lack of data Result: The lack of data genresA = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1 genresA = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1 genresB = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 genresB = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 genreDistance = the euclidean distance between genreDistance = the cosine distance between two two values. values. directorA = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1 directorA = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1 directorB = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 directorB = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 directorDistance = the euclidean distance between directorDistance = the cosine distance between two values. two values. writerA = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1 writerA = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1 writerB = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 writerB = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 writerDistance = the euclidean distance between writerDistance = the cosine distance between two two values. values. return genreDistance + directorDistance + return genreDistance + directorDistance + writerDistance writerDistance Algorithm 1: Cosine distance metric pseudocode Algorithm 3: Euclidean metric pseudocode Data: Input: The name of the movie 𝑛𝑎𝑚𝑒, Amount 𝑘, User Name 𝑢𝑠𝑒𝑟 Result: Featured Videos Data: Input: Id of the first movie 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1, Id of the second movie 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 newMovie = movie name Result: The lack of data distances=[] neighbors = [] genresA = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1 genresB = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 for 𝑚𝑜𝑣𝑖𝑒 in 𝑚𝑜𝑣𝑖𝑒𝑠 do genreDistance = the taxi cab distance between if movie not in history then two. Add distances to the distances array using the ’Similarities’ metric between the directorA = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1 given movie and the rest of the movies directorB = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 in the database. directorDistance = the taxi cab distance between end two values. end distances.sort() writerA = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑1 for 𝑥 in 𝑘 do writerB = genre of 𝑚𝑜𝑣𝑖𝑒𝐼𝑑2 Add to 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 calculated distances. writerDistance = the taxi cab distance between end two values. for 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 in 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 do return genreDistance + directorDistance + View featured video data 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 writerDistance end Algorithm 2: Taxicab metric pseudocode Algorithm 4: An algorithm that returns Recom- mended Videos based on user preferences. 51 Kamil Rojek et al. CEUR Workshop Proceedings 49–54 Data: Input: movie’s name, 6. Genres_bin - Converted column ’Genres’ to a k - number of films searched numerical form. Result: Prediction: k - movies’ name 7. Writers_bin - Converted column ’Writers’ to a movie = movie information database row numerical form. neighbors = KNN algorithm using the taxi metric, 8. Directors_bin - Converted column ’Directors’ given k - amount of movies to be found to a numerical form. for 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 in 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 do Based on the data base above, we have created several neighbors = KNN algorithm using the Cosine rankings that show the popularity ratio of the data that distance metric was used in the KNN algorithm: end for 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 in 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 do neighbors = KNN algorithm using the Euclidean metric end 𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑𝑀 𝑜𝑣𝑖𝑒𝑠 = [] for 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 in 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 do 𝑎𝑣𝑔𝑅𝑎𝑡𝑖𝑛𝑔 = average rating of the movie (additional information from knn) 𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑𝑀 𝑜𝑣𝑖𝑒𝑠 += [𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟, 𝑎𝑣𝑔𝑅𝑎𝑡𝑖𝑛𝑔] end return 𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑𝑀 𝑜𝑣𝑖𝑒𝑠 Algorithm 5: An algorithm containing various met- rics to find the best matches for the user. Figure 1: Top genres 9. Data base 9.1. Used Database The following database was used for demonstration pur- poses in a non-commercial, scientific manner - IMDB [32] data set downloaded from datasets.imdbws.com. Tables used: • name.basics.tsv • title.basics.tsv • title.crew.tsv • title.ratings.tsv The database, after our simplifications and prior prepara- tion, contains a collection of 9,827 films with information: Figure 2: Top writers 9.2. Description of the columns The set consists of 6000 rows and 7 columns. A detailed description is provided below: 1. OriginalTitle - Original title of a movie. 2. Genres - List of movie genres. 3. AverageRating - Average rating of a movie. 4. Writers - A list of writers of a given movie. 5. Directors - A list of directors of a given movie. 52 Kamil Rojek et al. CEUR Workshop Proceedings 49–54 [7] F. Bonanno, G. Capizzi, A. Gagliano, C. Napoli, Op- timal management of various renewable energy sources by a new forecasting method, in: SPEEDAM 2012 - 21st International Symposium on Power Elec- tronics, Electrical Drives, Automation and Motion, 2012, pp. 934–940. doi:10.1109/SPEEDAM.2012. 6264603. [8] Y. Zhang, S. Cheng, Y. Shi, D.-w. Gong, X. Zhao, Cost-sensitive feature selection using two-archive multi-objective artificial bee colony algorithm, Ex- pert Systems with Applications 137 (2019) 46–58. [9] M. Ren, Y. Song, W. Chu, An improved locally weighted pls based on particle swarm optimization for industrial soft sensor modeling, Sensors 19 Figure 3: Top directors (2019) 4099. [10] D. Połap, M. Woźniak, C. Napoli, E. Tramontana, R. Damaševičius, Is the colony of ants able to rec- ognize graphic objects?, Communications in Com- 10. Conclusion and future work puter and Information Science 538 (2015) 376–387. doi:10.1007/978-3-319-24770-0_33. In order to improve the operation of the algorithm and [11] M. Woźniak, A. Sikora, A. Zielonka, K. Kaur, M. S. to make the use of it more enjoyable, you can use a more Hossain, M. Shorfuzzaman, Heuristic optimization friendly GUI in the future. To make the algorithm work of multipulse rectifier for reduced energy consump- better, it is also possible to use more data (more extensive tion, IEEE Transactions on Industrial Informatics user history) to further refine the metic used. 18 (2021) 5515–5526. [12] F. Bonanno, G. Capizzi, C. Napoli, Some remarks References on the application of rnn and prnn for the charge- discharge simulation of advanced lithium-ions bat- [1] M. A. Sanchez, O. Castillo, J. R. Castro, Generalized tery energy storage, in: SPEEDAM 2012 - 21st Inter- type-2 fuzzy systems for controlling a mobile robot national Symposium on Power Electronics, Electri- and a performance comparison with interval type- cal Drives, Automation and Motion, 2012, pp. 941– 2 and type-1 fuzzy systems, Expert Systems with 945. doi:10.1109/SPEEDAM.2012.6264500. Applications 42 (2015) 5904–5914. [13] G. M. Khanal, G. Cardarilli, A. Chakraborty, S. Ac- [2] Q.-b. Zhang, P. Wang, Z.-h. Chen, An improved ciarito, M. Y. Mulla, L. Di Nunzio, R. Fazzolari, M. Re, particle filter for mobile robot localization based on A zno-rgo composite thin film discrete memristor, particle swarm optimization, Expert Systems with in: 2016 IEEE International Conference on Semicon- Applications 135 (2019) 181–193. ductor Electronics (ICSE), IEEE, 2016, pp. 129–132. [3] Y. Li, W. Dong, Q. Yang, S. Jiang, X. Ni, J. Liu, Auto- [14] F. Bonanno, G. Capizzi, S. Coco, C. Napoli, A. Lau- matic impedance matching method with adaptive dani, G. Sciuto, Optimal thicknesses determination network based fuzzy inference system for wpt, IEEE in a multilayer structure to improve the spp effi- Transactions on Industrial Informatics 16 (2019) ciency for photovoltaic devices by an hybrid fem - 1076–1085. cascade neural network based approach, in: 2014 In- [4] F. Qu, J. Liu, H. Zhu, D. Zang, Wind turbine condi- ternational Symposium on Power Electronics, Elec- tion monitoring based on assembled multidimen- trical Drives, Automation and Motion, SPEEDAM sional membership functions using fuzzy inference 2014, IEEE Computer Society, 2014, pp. 355–362. system, IEEE Transactions on Industrial Informat- doi:10.1109/SPEEDAM.2014.6872103. ics 16 (2019) 4028–4037. [15] S. Acciarito, A. Cristini, L. Di Nunzio, G. M. Khanal, [5] M. Woźniak, A. Zielonka, A. Sikora, Driving sup- G. Susi, An a vlsi driving circuit for memristor- port by type-2 fuzzy logic control model, Expert based stdp, in: 2016 12th Conference on Ph. D. Re- Systems with Applications 207 (2022) 117798. search in Microelectronics and Electronics (PRIME), [6] M. Woźniak, A. Zielonka, A. Sikora, M. J. Piran, IEEE, 2016, pp. 1–4. A. Alamri, 6g-enabled iot home environment con- [16] V. S. Dhaka, S. V. Meena, G. Rani, D. Sinwar, M. F. trol using fuzzy rules, IEEE Internet of Things Ijaz, M. Woźniak, A survey of deep convolutional Journal 8 (2020) 5442–5452. neural networks applied for prediction of plant leaf diseases, Sensors 21 (2021) 4749. 53 Kamil Rojek et al. CEUR Workshop Proceedings 49–54 [17] G. Capizzi, G. Lo Sciuto, C. Napoli, M. Woźniak, neural network-based finger-vein recognition using G. Susi, A spiking neural network-based long-term nir image sensors, Sensors 17 (2017) 1297. prediction system for biogas production, Neural [28] G. Capizzi, G. Lo Sciuto, M. Woźniak, R. Damaše- Networks 129 (2020) 271 – 279. vičius, A clustering based system for automated oil [18] N. Brandizzi, S. Russo, R. Brociek, A. Wajda, First spill detection by satellite remote sensing, Lecture studies to apply the theory of mind theory to green Notes in Computer Science (including subseries and smart mobility by using gaussian area cluster- Lecture Notes in Artificial Intelligence and Lecture ing, volume 3118, CEUR-WS, 2021, pp. 71–76. Notes in Bioinformatics) 9693 (2016) 613 – 623. [19] F. Bonanno, G. Capizzi, G. Lo Sciuto, C. Napoli, [29] A. T. Özdemir, B. Barshan, Detecting falls with Wavelet recurrent neural network with semi- wearable sensors using machine learning tech- parametric input data preprocessing for micro-wind niques, Sensors 14 (2014) 10691–10708. power forecasting in integrated generation systems, [30] K. G. Liakos, P. Busato, D. Moshou, S. Pearson, in: 5th International Conference on Clean Electri- D. Bochtis, Machine learning in agriculture: A cal Power: Renewable Energy Resources Impact, review, Sensors 18 (2018) 2674. ICCEP 2015, 2015, p. 602 – 609. [31] G. Iannizzotto, L. Lo Bello, A. Nucita, G. M. Grasso, [20] G. Borowik, M. Woźniak, A. Fornaia, R. Giunta, A vision and speech enabled, customizable, virtual C. Napoli, G. Pappalardo, E. Tramontana, A soft- assistant for smart environments, in: 2018 11th ware architecture assisting workflow executions International Conference on Human System Inter- on cloud resources, International Journal of Elec- action (HSI), IEEE, 2018, pp. 50–56. tronics and Telecommunications 61 (2015) 17–23. [32] I. IMDb.com, Imdb datasets, 1990-2022. URL: https: doi:10.1515/eletel-2015-0002. //www.imdb.com/interfaces/. [21] M. Wieczorek, J. Siłka, M. Woźniak, S. Garg, M. M. Hassan, Lightweight convolutional neural net- work model for human face detection in risk situa- tions, IEEE Transactions on Industrial Informatics 18 (2021) 4820–4829. [22] S. Illari, S. Russo, R. Avanzato, C. Napoli, A cloud- oriented architecture for the remote assessment and follow-up of hospitalized patients, in: CEUR Workshop Proceedings, volume 2694, CEUR-WS, 2020, pp. 29–35. [23] N. Dat, V. Ponzi, S. Russo, F. Vincelli, Supporting impaired people with a following robotic assistant by means of end-to-end visual target navigation and reinforcement learning approaches, in: CEUR Workshop Proceedings, volume 3118, CEUR-WS, 2021, pp. 51–63. [24] R. Brociek, G. Magistris, F. Cardia, F. Coppa, S. Russo, Contagion prevention of covid-19 by means of touch detection for retail stores, in: CEUR Workshop Proceedings, volume 3092, CEUR-WS, 2021, pp. 89–94. [25] O. Dehzangi, M. Taherisadr, R. ChangalVala, Imu- based gait recognition using convolutional neural networks and multi-sensor fusion, Sensors 17 (2017) 2735. [26] M. Wozniak, C. Napoli, E. Tramontana, G. Capizzi, G. Lo Sciuto, R. Nowicki, J. Starczewski, A mul- tiscale image compressor with rbfnn and discrete wavelet decomposition, in: Proceedings of the In- ternational Joint Conference on Neural Networks, volume 2015-September, Institute of Electrical and Electronics Engineers Inc., 2015. doi:10.1109/ IJCNN.2015.7280461. [27] H. G. Hong, M. B. Lee, K. R. Park, Convolutional 54