Suggesting a Specific Factor-driven Career Choice using KNN and Soft Set Algorithms Joanna Bodora1 , Jadwiga Cader1 and Nikola GΔ™bka1 1 Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44-100 Gliwice, Poland Abstract Choosing perfect work path is not an easy task especially in IT sector. Lately we can notice that data science and jobs connected to this field are getting more and more popular. To reduce time consumed on finding perfect work position in data science, authors have presented solution, which selects best job based on factors introduced by user. Final job title is a result of combining soft set algorithm with analyzed accuracies of k-nearest neighbours algorithms classified with different k parameters and on various collections. Keywords Soft set, k-nearest neighbours, Classification 1. Introduction rithm. The soft set table consists of columns that are a specific factor on which we focus, and the rows are the Nowadays, IT systems [1, 2] very often use artificial in- next algorithms from KNN, while the content of the table telligence methods, which allow not only to download is the accuracy that we obtained using a specific KNN and process data [3], but also to infer and support the algorithm. decision-making process based on them. One of the im- The program is written in Python, has no graphical portant branches of artificial intelligence systems are interface and is executed in the IDE. The data is in the fuzzy sets [4, 5, 6], which are used in numerous applica- form of a database in the .csv file, while the user enters tions, among others, in the detection of pavement dam- the weights and values for each column in the form of a age [7] or in smart home management [8, 9]. The second list directly in the program. important direction of applications are the optimization algorithms [10, 11, 12, 13], which are used in optimiza- tion processes, where the aim is to minimize or maximize 2. K-nearest neighbors Algorithm the objective function [14, 15, 2]. An interesting appli- cation of the heuristic algorithm concerns the reduction The K Nearest Neighbors algorithm is a ranking of energy consumption [16, 17, 18? ]. An important algorithm, it evaluates to which group the point belongs part of optimization algorithms are algorithms modeled to from the current iteration of the algorithm in the on the behavior of animals cooperating in large groups surface. The classification works on the basis of counting [19, 20]. These algorithms, imitating the behavior of the the number of the nearest neighbors points in a given community, e.g. ants and bees, allow you to quickly and group, the score is returned based on the vote of the effectively achieve the goal. The third direction of the majority. development of artificial intelligence are all kinds of meth- ods based on artificial neural networks [21, 22]. They Data analysis is based on clustering. The pro- are widely used in medicine, in the care of the elderly gram classifies data based on different variants of the [23, 24, 25], in detection [26, 27] as well as in machine KNN (k-nearest neighbors) algorithm. It consists in learning [28, 29, 30, 31]. finding the k elements already classified (neighbors) We created a program that allows you to choose a ca- closest to the new element and assigning this element reer path based on specific factors. The program will to the group to which most of its neighbors belong. make it possible to select the optimal result using the Several metrics are used to determine the similarity, this k nearest neighbors algorithm and using soft sets. We program uses two: Manhattan (Taxi Cab) and Minkowski. create a table for soft sets with the accuracy of various types of distance calculation methods in the KNN algo- Manhattan metric 𝑛 SYSYEM 2022: 8th Scholar’s Yearly Symposium of Technology, Engi- βˆ‘οΈ 𝑑(x, y) = |π‘₯𝑖 βˆ’ 𝑦𝑖 | (1) neering and Mathematics, Brunek, July 23, 2022 𝑖=1 " bodorajoanna@polsl.pl (J. Bodora); jadwcad575@polsl.pl (J. Cader); nikogeb061@polsl.pl (N. GΔ™bka) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Where: CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) d – distance, 41 Joanna Bodora et al. CEUR Workshop Proceedings 41–48 x – value of a sample, the soft set above π‘ˆ , where 𝐹 is the mapping given by y – value of a classified element, 𝐹 : 𝐴 β†’ 𝑃 (π‘ˆ ). Others in words, the soft set (𝐹, 𝐴) n – amount of elements in the sample over U is a parameterized family of the subset π‘ˆ . For 𝑒 ∈ 𝐴, 𝐹 (𝑒) can be considered a set of e-elements or Minkowski metric – a modified Euclidean metric e-approximate elements of soft sets (𝐹, 𝐴). Thus, (𝐹, 𝐴) (οΈƒ 𝑛 )οΈƒ1/π‘š is defined as: βˆ‘οΈ πΏπ‘š (x, y) = |π‘₯𝑖 βˆ’ 𝑦𝑖 | π‘š . (2) 𝑖=1 (𝐹, 𝐴) = {𝐹 (𝑒) ∈ 𝑃 (π‘ˆ ) : 𝑒 ∈ 𝐸, 𝐹 (𝑒) = βˆ… , if Where: π‘’βˆˆ/ 𝐴} d – distance, x – value of a sample, 𝑛 βˆ‘οΈ y – value of a classified element, 𝑠𝑖 Β· 𝑀𝑖 (4) n – amount of elements in the sample 𝑖=1 m – any small integer, β€’ 𝑠𝑖 – element of the sample After calculating the distance, the data is clustered: β€’ 𝑀𝑖 – weight first sorted in ascending order, then voting is done on β€’ 𝑛 – length of the sample the basis of a 1:1 matching of the sample attribute to the test set attribute - the same elements are added to the common set. Then, the percentage share of the searched 4. Other methods used elements in relation to the entire data set is calculated: Cross validation– a statistical method involving divi- sion statistical sample for subsets, and then conducting π΄π‘π‘π‘’π‘Ÿπ‘Žπ‘π‘¦ = π‘ π‘–π‘§π‘’π‘œπ‘“π‘ π‘–π‘§π‘’π‘œπ‘“ π‘‘β„Žπ‘’π‘ π‘’π‘‘π‘œπ‘“ π‘šπ‘Žπ‘‘π‘β„Žπ‘’π‘‘π‘’π‘™π‘’π‘šπ‘’π‘›π‘‘π‘  Γ— 100% analyzes of the training set, while the test set is used to π‘‘β„Žπ‘’π‘€β„Žπ‘œπ‘™π‘’π‘‘π‘Žπ‘‘π‘Žπ‘ π‘’π‘‘ confirm the plausibility of its results. The variable k largely determines the behavior of the classifier. Determines the number of the closest neigh- Rule extraction – rejection of variables not use- bors that decide on the classification of the element. It is ful in the study. a natural number. This parameter is arbitrary, but if we want our classifier to work efficiently, we must make a Data normalization is scaling data into a range few assumptions: Min-max normalization using a linear function, β€’ K must be greater than the square root of the it reduces the data to the interval specified by the user number of all classified elements (newmin, newmax). At the same time, we should know √ the range that the data can achieve. If we do not know π‘˜ β‰₯ 𝑛, it, we can use the highest and the smallest value in the n - number of classified elements analyzed set. β€’ If the number of groups is even, k must be odd. π‘₯β€² = π‘šπ‘Žπ‘₯βˆ’π‘šπ‘–π‘› π‘₯βˆ’π‘šπ‘–π‘› Β· π‘›π‘’π‘€π‘šπ‘Žπ‘₯ βˆ’ π‘›π‘’π‘€π‘šπ‘–π‘› + π‘›π‘’π‘€π‘šπ‘–π‘› Otherwise, k must be even. {οΈƒ This algorithm is used for both regression and classi- 2π‘Ž + 1, 𝑐|2 fication. Useful when dependencies between objects of π‘˜= (3) 2π‘Ž, otherwise the same classes are difficult to interpret. c – number of groups, π‘Ž ∈ 𝑁 5. Database β€’ K must be greater than the number of groups The project was created with the use of a database taken π‘˜>𝑐 from the website https://www.kaggle.com. The database deals with salaries in individual professions in work related to the field of data analysis. 3. Soft Set Database link: Let π‘ˆ be the initial infinite set and 𝐸 the set of https://www.kaggle.com/datasets/saurabhshahane/ parameters or attributes relative to π‘ˆ . Let 𝑃 (π‘ˆ ) denote data-science-jobs-salaries the power set π‘ˆ i 𝐴 βŠ† 𝐸. The (𝐹, 𝐴) pair is called 42 Joanna Bodora et al. CEUR Workshop Proceedings 41–48 6. Implementation of KNN algorithm The final program was developed to return best KNN algorithms based on accuracy which we get from ana- lyzing different options. We implemented two types of KNN algorithms, one based on distances between val- Figure 1: Histogram presenting values of annual salary ues of sample and dataset tried to give best job position sorting by distances and summing appearance of various job titles. Second algorithm also calculated distances but the classified data. Also salaries in different jobs posi- firstly it focused on getting a specific category of work tions overlap in ranges, which may make it difficult to and then from this limited collection of data it returned distinguish positions based on the amount of salary. nearest neighbours for job positions. Both of these al- Plots Fig. 4 and Fig. 5 presenting the connection be- gorithms were closely analyzed and results showed that tween the location or the nationality of employee and the classic KNN algorithm without any categorization gives amount of salary shows that the research was conducted best accuracy. mainly on the American market, also the scope of salaries of employees of different nationalities and companies Data: Input π‘ π‘Žπ‘šπ‘π‘™π‘’, π‘‘π‘Žπ‘‘π‘Žπ‘‡ π‘Žπ‘, π‘˜ from other countries rather coincides, i.e. the amount Result: π‘—π‘œπ‘π‘‡ 𝑖𝑑𝑙𝑒 of the salary does not depend on the citizenship of the 𝑑𝑖𝑠𝑑 := []; employee or the country in which he works. Therefore it π‘π‘™π‘Žπ‘ π‘ π‘’π‘  := []; can be concluded that there are certain salary scales that while 𝑖 < 𝑙𝑒𝑛(π‘‘π‘Žπ‘‘π‘Žπ‘‡ π‘Žπ‘) do are offered in IT positions in data analysis regardless of Calculate distance between sample and record location or nationality. in π‘‘π‘Žπ‘‘π‘Žπ‘‡ π‘Žπ‘, save it to 𝑑𝑖𝑠𝑑; Fig. 6 shows the connection between the employee’s end level of experience and his salary. The highest rate was Add 𝑑𝑖𝑠𝑑 as new column to π‘‘π‘Žπ‘‘π‘Žπ‘‡ π‘Žπ‘; offered to the person with the greatest responsibility, Sort π‘‘π‘Žπ‘‘π‘Žπ‘‡ π‘Žπ‘ by column 𝑑𝑖𝑠𝑑; i.e. working in an executive position, for example the for 𝑖 in range(0,k) do position of director, leader or project manager. Then Save number of different job title’s the seniors have the highest stake. The lowest stake is occurrences for π‘˜ first records in π‘‘π‘Žπ‘‘π‘Žπ‘‡ π‘Žπ‘ to accumulated in the junior experience group. There are π‘π‘™π‘Žπ‘ π‘ π‘’π‘ ; also single outliers in each group. end Fig. 7 checks if there is any connection between the return π‘—π‘œπ‘π‘‡ 𝑖𝑑𝑙𝑒 that appeared most frequently amount of the salary and company’s size. We may notice in π‘π‘™π‘Žπ‘ π‘ π‘’π‘ ; the lack of huge differences in stakes for employees from Algorithm 1: Algorithm of our implementation of various companies. KNN Pie charts Fig. 8 and Fig. 9 were generated to verify the percentage of various work modes and the types of em- ployments. It shows that remote or semi-remote work is 7. Analyzing dataset provided in almost 85 percent of positions, while full-time employment predominates in the type of employment. The histogram Fig. 1 and plot Fig. 3 show the perfor- Summing up, the data available does not stand out for mance of the earnings in the field of datascience. It in- a specific group of job positions or, for example, for a forms us that there are over 160 people earning between certain location of the company, which may result in the 0 to 10000 USD per year. We note that earnings cumulate difficulty of their classification and lower accuracy. The in the range of approximately 50000 to 200000 USD. The lack of visible boundaries in the rates due to the size of the remaining values are sporadic and we look at them as company shown in Fig. 7 or the small number of records outliers. for certain positions Fig. 2 will be factors that make From the chart Fig. 2, we obtain information about the classification difficult. Also, the predominance of the earnings for a specific position. We also note the number location of companies and the citizenship of employees of records that will define a given job. Positions such from the United States makes the data reflect the reality as Data Scientist or Data Engineer have more records rather for developed countries. than, for example Data Specialist, which appears only once in the database. Not having the same number of records for different positions will affect the accuracy of 43 Joanna Bodora et al. CEUR Workshop Proceedings 41–48 Figure 2: Plot presenting values of annual salary according to the job title 8. Analyzing KNN performance result, we get low accuracy of the algorithm’s operation. Therefore, in further action, despite the re-verification Presented KNN algorithms have achieved an accuracy be- of the operation on normalized values, we gave up using tween 37 to 89% for classification based on job title. Data this normalized data due to the very low accuracy. were divided in proportions adequately 30% testing and We may notice on Fig. 12 that normalizing only salary 70% training part. Results were analyzed to determine column itself, which initially takes values in thousandths, perfect combination of dataset, k parameter and variety allows to increase the accuracy of the classification with of distance metrics used in KNN algorithm. We focused the use of job type categorization. One more time, the on two types of distance metrics Minkowski and Manhat- taxicab metric is a better method of calculating distances. tan. Comparison test consist on checking performance Graph and table on Fig. 13 show the accuracies for of KNN algorithm on normalized dataset, not normalized different k using the classic KNN algorithm without addi- dataset and normalized data but only in salary column. tional categorization. The accuracy values are practically Graphs presented in Fig. 10 show the influence of k the same with minimal variation depending on the dis- on the accuracy of the algorithm for k nearest neighbors tance metric used. using an additional column of job categories. We can Working on completely normalized data in each of see that for k equal to 8 there is a sudden decrease in the columns turns out to be pointless due to the very accuracy for both the Minkowski method of distance low accuracy that we obtain regardless of the parameter calculation and the taxi method. Then the values from k k Fig. 14. Therefore, in the created table for the soft equal to 9 decrease. Better accuracy is obtained by using set algorithm, we do not take into account the accuracy the Manhattan distance metrics. obtained when working on this type of data sets. The impact of k on accuracy shown in Fig. 11, informs In the presented graphs Fig. 15, we may notice that us that normalizing all columns with little variation in the parameter k affects the determination of the accuracy. data does not allow algorithm to classify properly. As a 44 Joanna Bodora et al. CEUR Workshop Proceedings 41–48 Figure 5: Plot presenting values of annual salary according to the company location Figure 3: Plot presenting values of annual salary Figure 6: Plot presenting values of annual salary according to the employee’s experience level Figure 7: Plot presenting values of annual salary according to the company size decreases. On the other hand, when using the Manhattan metric, values decrease from the intial k. 9. Experiments Figure 4: Plot presenting values of annual salary according to the nationality of an employee Table Fig. 16 presents the obtained table for the operation of the soft set algorithm. This table contains accuracies for the following KNN algorithms from the lines, using the given parameter k as well as a specific data set. We In the graphs on the left, which uses the Minkowski obtain this soft set table after analyzing for which param- metric to calculate the distance, we see that the accuracy eters k gives the best accuracy. remains high for the initial 4 k values and then gradually 45 Joanna Bodora et al. CEUR Workshop Proceedings 41–48 Figure 10: Results and plots presenting impact of K param- eter on accuracy of KNN classification with category of not Figure 8: Pie chart presenting the percentage of different normalized values using Minkowski and Manhattan distance types of work metrics Figure 11: Results and plots presenting impact of K parame- ter on accuracy of KNN classification with category of nor- malized values using Minkowski and Manhattan distance metrics Figure 9: Pie chart presenting the percentage of different form of employments 10. Conclusion As we can see presented solution allows the user to find perfect job position based on factors, which he or she focuses on. Because of in-depth reporting of data set we could distinguish best combinations of KNN algorithm in terms of k parameter, distance metric and data set itself. Thanks to creating soft set table of accuracies of different KNN solutions we get best algorithm, which also gives factors we focus on the most the utmost importance. Figure 12: Results and plots presenting impact of K param- eter on accuracy of KNN classification with category of nor- A. Online Resources malized values only in salary column using Minkowski and Manhattan distance metrics The sources for the solution are available via β€’ GitHub 46 Joanna Bodora et al. CEUR Workshop Proceedings 41–48 Figure 16: Table showing the final accuracies for selected algorithms on specific data References [1] M. A. Sanchez, O. Castillo, J. R. Castro, Generalized type-2 fuzzy systems for controlling a mobile robot Figure 13: Results and plots presenting impact of K param- eter on accuracy of KNN classification of not normalized val- and a performance comparison with interval type- ues using Minkowski and Manhattan distance metrics 2 and type-1 fuzzy systems, Expert Systems with Applications 42 (2015) 5904–5914. [2] Q.-b. Zhang, P. Wang, Z.-h. Chen, An improved particle filter for mobile robot localization based on particle swarm optimization, Expert Systems with Applications 135 (2019) 181–193. [3] J. W. W. L. Z. B. Wei Dong, Marcin WoΕΊniak, De- noising aggregation of graph neural networks by using principal component analysis, IEEE Transac- tions on Industrial Informatics (2022). [4] Y. Li, W. Dong, Q. Yang, S. Jiang, X. Ni, J. Liu, Auto- matic impedance matching method with adaptive network based fuzzy inference system for wpt, IEEE Transactions on Industrial Informatics 16 (2019) 1076–1085. [5] F. Qu, J. Liu, H. Zhu, D. Zang, Wind turbine condi- Figure 14: Results and plots presenting impact of K param- tion monitoring based on assembled multidimen- eter on accuracy of KNN classification of normalized values sional membership functions using fuzzy inference using Minkowski and Manhattan distance metrics system, IEEE Transactions on Industrial Informat- ics 16 (2019) 4028–4037. [6] A. Carpenzano, R. Caponetto, L. Lo Bello, O. Mirabella, Fuzzy traffic smoothing: An ap- proach for real-time communication over ethernet networks, in: 4th IEEE International Workshop on Factory Communication Systems, IEEE, 2002, pp. 241–248. [7] M. WoΕΊniak, A. Zielonka, A. Sikora, Driving sup- port by type-2 fuzzy logic control model, Expert Systems with Applications 207 (2022) 117798. [8] M. WoΕΊniak, A. Zielonka, A. Sikora, M. J. Piran, A. Alamri, 6g-enabled iot home environment con- trol using fuzzy rules, IEEE Internet of Things Journal 8 (2020) 5442–5452. Figure 15: Results and plots presenting impact of K param- [9] C. Napoli, G. Pappalardo, E. Tramontana, Improv- eter on accuracy of KNN classification of normalized values ing files availability for bittorrent using a diffu- only in salary column using Minkowski and Manhattan dis- sion model, in: Proceedings of the Workshop tance metrics on Enabling Technologies: Infrastructure for Col- laborative Enterprises, WETICE, IEEE Computer Society, 2014, pp. 191–196. doi:10.1109/WETICE. 2014.65. 47 Joanna Bodora et al. CEUR Workshop Proceedings 41–48 [10] T. Qiu, B. Li, X. Zhou, H. Song, I. Lee, J. Lloret, diseases, Sensors 21 (2021) 4749. A novel shortcut addition algorithm with particle [22] C. Napoli, F. Bonanno, G. Capizzi, An hybrid neuro- swarm for multisink internet of things, IEEE Trans- wavelet approach for long-term prediction of solar actions on Industrial Informatics 16 (2019) 3566– wind, Proceedings of the International Astronomi- 3577. cal Union 6 (2010) 153 – 155. [11] D. Yu, C. P. Chen, Smooth transition in communica- [23] M. WoΕΊniak, M. Wieczorek, J. SiΕ‚ka, D. PoΕ‚ap, Body tion for swarm control with formation change, IEEE pose prediction based on motion sensor data and Transactions on Industrial Informatics 16 (2020) recurrent neural network, IEEE Transactions on 6962–6971. Industrial Informatics 17 (2020) 2101–2111. [12] G. Capizzi, G. Lo Sciuto, C. Napoli, R. Shikler, [24] S. Illari, S. Russo, R. Avanzato, C. Napoli, A cloud- M. Wozniak, Optimizing the organic solar cell man- oriented architecture for the remote assessment ufacturing process by means of afm measurements and follow-up of hospitalized patients, in: CEUR and neural networks, Energies 11 (2018). Workshop Proceedings, volume 2694, CEUR-WS, [13] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramontana, 2020, pp. 29–35. An advanced neural network based solution to en- [25] N. Dat, V. Ponzi, S. Russo, F. Vincelli, Supporting force dispatch continuity in smart grids, Applied impaired people with a following robotic assistant Soft Computing Journal 62 (2018) 768 – 775. by means of end-to-end visual target navigation [14] J. Yi, J. Bai, W. Zhou, H. He, L. Yao, Operating and reinforcement learning approaches, in: CEUR parameters optimization for the aluminum electrol- Workshop Proceedings, volume 3118, CEUR-WS, ysis process using an improved quantum-behaved 2021, pp. 51–63. particle swarm algorithm, IEEE Transactions on [26] O. Dehzangi, M. Taherisadr, R. ChangalVala, Imu- Industrial Informatics 14 (2017) 3405–3415. based gait recognition using convolutional neural [15] C. Napoli, G. Pappalardo, E. Tramontana, Using networks and multi-sensor fusion, Sensors 17 (2017) modularity metrics to assist move method refactor- 2735. ing of large systems, in: Proceedings - 2013 7th [27] H. G. Hong, M. B. Lee, K. R. Park, Convolutional International Conference on Complex, Intelligent, neural network-based finger-vein recognition using and Software Intensive Systems, CISIS 2013, 2013, nir image sensors, Sensors 17 (2017) 1297. pp. 529–534. doi:10.1109/CISIS.2013.96. [28] A. T. Γ–zdemir, B. Barshan, Detecting falls with [16] F. Bonanno, G. Capizzi, C. Napoli, Some remarks wearable sensors using machine learning tech- on the application of rnn and prnn for the charge- niques, Sensors 14 (2014) 10691–10708. discharge simulation of advanced lithium-ions bat- [29] N. Brandizzi, V. Bianco, G. Castro, S. Russo, A. Wa- tery energy storage, in: SPEEDAM 2012 - 21st Inter- jda, Automatic rgb inference based on facial emo- national Symposium on Power Electronics, Electri- tion recognition, in: CEUR Workshop Proceedings, cal Drives, Automation and Motion, 2012, pp. 941– volume 3092, CEUR-WS, 2021, pp. 66–74. 945. doi:10.1109/SPEEDAM.2012.6264500. [30] R. Brociek, G. Magistris, F. Cardia, F. Coppa, [17] M. WoΕΊniak, A. Sikora, A. Zielonka, K. Kaur, M. S. S. Russo, Contagion prevention of covid-19 by Hossain, M. Shorfuzzaman, Heuristic optimization means of touch detection for retail stores, in: CEUR of multipulse rectifier for reduced energy consump- Workshop Proceedings, volume 3092, CEUR-WS, tion, IEEE Transactions on Industrial Informatics 2021, pp. 89–94. 18 (2021) 5515–5526. [31] K. G. Liakos, P. Busato, D. Moshou, S. Pearson, [18] F. Bonanno, G. Capizzi, A. Gagliano, C. Napoli, Op- D. Bochtis, Machine learning in agriculture: A timal management of various renewable energy review, Sensors 18 (2018) 2674. sources by a new forecasting method, 2012, pp. 934– 940. doi:10.1109/SPEEDAM.2012.6264603. [19] M. Ren, Y. Song, W. Chu, An improved locally weighted pls based on particle swarm optimization for industrial soft sensor modeling, Sensors 19 (2019) 4099. [20] Y. Zhang, S. Cheng, Y. Shi, D.-w. Gong, X. Zhao, Cost-sensitive feature selection using two-archive multi-objective artificial bee colony algorithm, Ex- pert Systems with Applications 137 (2019) 46–58. [21] V. S. Dhaka, S. V. Meena, G. Rani, D. Sinwar, M. F. Ijaz, M. WoΕΊniak, A survey of deep convolutional neural networks applied for prediction of plant leaf 48