58 An empirical comparison of machine learning clustering methods in the study of Internet addiction among students majoring in Computer Sciences Oksana Klochko[0000-0002-6505-9455] Vinnytsia Mykhailo Kotsiubynskyi State Pedagogical University, 32, Ostrozhskogo Str., Vinnytsia, 21100, Ukraine klochkoob@gmail.com Vasyl Fedorets[0000-0001-9936-3458] Institute of Higher Education of the NAES of Ukraine, 9, Bastionna Str., Kyiv, 01014, Ukraine bruney333@yahoo.com Abstract. One of the relevant current vectors of study in machine learning is the analysis of the application peculiarities for methods of solving a specific problem. We will study this issue on the example of methods of solving the clustering problem. Currently, we have a considerable number of learning algorithms which can be used for clustering. However, not all methods can be used for solving a specific task. The article describes the technology of empirical comparison of methods of clustering problem solving using WEKA free software for machine learning. Empirical comparison of data clustering methods was based on the results of a survey conducted among students majoring in Computer Sciences and dedicated to detecting signs of Internet addiction (IA) as behavioural disorder that occurs due to Internet misuse. Empirical comparison of Expectation Maximization, Farthest First and K-Means clustering algorithms together with the application of the WEKA machine learning system had the following results. It described the peculiarities of application of these methods in feature clustering. The authors developed data instances’ clustering models to detect signs of Internet addiction among students majoring in Computer Sciences. The study concludes that these methods may be applicable to development of models detecting respondent groups with signs of IA related disorders. Keywords: Empirical Comparison, Machine Learning, Clustering, Internet addiction (IA), IA detection, Internet disorders, Expectation Maximization, Farthest First, K-Means. 1 Introduction One of research directions in machine learning is the empirical analysis of methods of solving a specific problem. Let us study this issue on the example of methods of solving ___________________ Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 59 the clustering problem. Clustering methods are statistic methods of data analysis that enable people to group the given selection of data samples into clusters, classes, taxons depending on the value of their attributes; each of these groups has certain characteristics. The main idea is to use several clustering methods in order to carry out an empirical comparison study and determine which methods ensure the most optimal data grouping while solving a specific problem. Machine learning classifies clustering problems as problems for unsupervised learning. Currently, there is a considerable number of machine learning algorithms that can be used for clustering, for instance, Expectation Maximization, K-Means, Hierarchical Clustering etc. But not all of them are suitable for solving a specific problem. Data clustering algorithms differ by the cluster model type, the algorithm model type, the nesting hierarchy of clusters, the way of implementation depending on the data set etc. Because of this, there are also certain requirements to the data set parameters. Popular software products used in machine learning include TensorFlow, WEKA, MATLAB, MXNet, Torch, PyTorch, Microsoft Azure Machine Learning Studio and others. In this article, we use the WEKA (Waikato Environment for Knowledge Analysis) free machine learning software [19]. The free WEKA machine learning system gives direct access to the library of implemented algorithms written in Java. Analysis of contemporary studies and publications shows that the issue of analysis and selection of the machine learning method, which would be optimal for processing a concrete data set, is popular in the scientific circles. A considerable number of these studies is dedicated to the application of machine learning methods in the fields of healthcare and life safety. In their article A Performance Comparison of Machine Learning Classification Approaches for Robust Activity of Daily Living Recognition scientists Rida Ghafoor Hussain, Mustansar Ali Ghazanfar, Muhammad Awais Azam, Usman Naeem and Shafiq Ur Rehman studied the application of the machine learning classification methods to find ways to ensure independent daily living of people who have Alzheimer’s disease [7]. The idea of the study is to analyze the data registered by different equipment in order to determine the changes in a person’s behavior that are relevant for the daily life and social interaction. The paper gives a comparison of the efficiency levels of five machine learning classification techniques used for the recognition of a person’s activity (and his/her psychological status). Experimental findings show that compared to traditional methodologies, these approaches give better results in determining the activity of the person and his/her psychological and behavioral peculiarities. Jonas Krämer, Jonas Schreyögg and Reinhard Busse studied the speed and efficiency of medical aid provision using the databases of the Hospital ER [13]. Applying the Random forest algorithm, the authors developed the model based on the data about the patient’s provisional diagnosis. The use of the controlled machine learning method and model training based on the opinion of a specialized doctor allowed them to achieve high forecasting accuracy (96%) as well as the area under the receiver operating curve (>0.99). 60 Abdulhamit Subasi, Jasmin Kevric and M. Abdullah Canbaz developed a hybrid model of detecting epileptic fits using the Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) to determine the optimal parameters of application of the Support Vector Machine (SVM) algorithm [17]. The hybrid algorithm that they suggested can demonstrate data set classification accuracy of up to 99.38%. A considerable number of papers appeared, which are dedicated to diagnosing Internet addiction (IA) and studying the mechanisms of this disorder among various social groups. The appearance and use of the Internet has many benefits. However, at the same time, disorders related to pathological use of the Internet are becoming a social as well as a psychological problem. Currently, we face an important psychological, sociocultural and educational issue of detection and prevention of certain pathologies and steady premorbid conditions (state before the disease) caused by inadequate Internet use. Cases of IA were first mentioned in 1995 and attracted considerable attention. Issues related to this one became the research subject of many scientists, including Lyudmyla Yuryeva and Tatyana Bolbot [10], Marharyta Derhach [4] and others. Internet Addiction Disorder (IAD) is also called Pathological Internet Use (PIU). The term “Internet Addiction” was first suggested by Ivan K. Goldberg in 1995. He describes net addiction as a specific pathology characterized by a wide spectrum of behavioral and impulse control disorders (lack of control, absence of voluntary regulation) [1]. In 1996 Goldberg made the first attempt to determine groups of behavioural and psychological signs and symptoms of IA [18], namely: tolerance; abstinence syndrome; difficulties in voluntary regulation of Internet-behaviour; increase of time and financial investments in things related to Internet or computer use; a shift of a person’s interests towards Internet-related activities; extensive Internet use that leads to maladjustment. In 1998 Kimberly S. Young defined IAD as an impulsive- compulsive disorder, which has specific signs or addictions [20; 21]: cyber-sexual addiction, cyber-relationship addiction, net compulsions, information overload and computer addiction. IAD is not officially included into ICD-11 for Mortality and Morbidity Statistics (Version: 04/2019), however, in section 6C51 Gaming disorder the “Gaming disorder” is described as a “pattern of persistent or recurrent gaming behaviour (‘digital gaming’ or ‘video-gaming’), which may be online (i.e., over the Internet)” [8]. Even though the problem of IA is becoming more and more relevant, there are not enough scientific papers dedicated to the study of this issue with the help of machine learning methods. Let us look at some of them. On the basis of the Support Vector Machine algorithm, including the C-SVM and ν-SVM, and applying the Student’s t- test to the data set of the survey conducted among 2,397 Chinese students, scientists Zonglin Di, Xiaoliang Gong, Jingyu Shi, Hosameldin O. A. Ahmed and Asoke K. Nandic proved the utility of using machine learning methods for detecting and forecasting the risk of IA [5]. Wen-Huai Hsieh, Dong-Her Shih, Po-Yuan Shih and Shih-Bin Lin suggested using the EMBAR protected system of web-services based on the ensemble classification methods and case-based reasoning to study the IA of the users and prevent the development of this disorder at the initial stages [6]. Hong-Ming Ji, Liang-Yu Chen and Tzu-Chien Hsiao are currently continuing their research, which aims to create an IA detector that would work in a real-time mode [9]. The authors 61 suggest studying this issue using an adapted system of continuous real-coded variables (XCSR), which determines the level of Internet addiction (high-risk and low-risk) on the basis of the information about the Internet users using the Chen Internet addiction scale (CIAS) or respiratory instantaneous frequency (IF) [9]. Thus, based on the above presented statement of the problem as well as taking into consideration the insufficient amount of research on the application of machine learning methods to IA diagnosing, we determine the aim of our research, which is to conduct an empirical comparison of clustering methods within the WEKA machine learning system in the course of studying the IA disorder among students majoring in Computer Sciences. 2 Selection of methods and diagnostics Data regarding the spread and severity of IA among students majoring in Computer Sciences were received from an online survey, which used a questionnaire drafted with the help of Google Forms. 263 students majoring in Computer Sciences and coming from different oblasts of Ukraine participated in the experimental study. The data set is presented in the ARFF format and consists of 8 attributes (Fig. 1). The data set contains the fields described in Table 1. @relation answer_IA @attribute age numeric @attribute sex {female,male} @attribute 3 {no,undefined,yes} @attribute 4 {no,undefined,yes} @attribute 5 {no,undefined,yes} @attribute 6 {no,undefined,yes} @attribute 7 {no,undefined,yes} @attribute 8 {no,undefined,yes} @data 18,male,yes,no,no,no,no,yes 28,male,undefined,no,no,no,no,yes 20,female,yes,yes,yes,no,no,no 22,male,yes,no,no,no,no,no … Fig. 1. Data set on the state of IA among students majoring in Computer Sciences, presented in the ARFF format Cluster analysis is one of the tasks of database mining. Cluster analysis is a set of methods of multidimensional observations or objects classification, based on defining the concept of distance between the objects and their subsequent grouping (into clusters, taxons, classes). The selection of a concrete cluster analysis method depends 62 on the purpose of classification [12]. At the same time, one does not need a priori information about the population distribution. This approach is based on the following presuppositions: objects that have a certain number of similar (different) features group in one segment (cluster). The level of similarity (difference) between the objects that belong to one segment (cluster) must be higher than the level of their similarity with the objects that belong to other segments [12]. Table 1. Data structure on the state of IA among students majoring in Computer Sciences. Attributes Сontents/Questions Type Statistics age Age of the student Numeric Minimum 16 Maximum 59 Mean 19.756 StdDev 6.806 sex Student’s sex Nominal Female 199 Male 63 3 Can’t imagine my life without the Nominal yes 184 Internet undefined 39 no 39 4 When I cannot use the Internet I fell Nominal yes 81 anxiety, irritation undefined 134 no 47 5 I like “surfing” the Net without a Nominal yes 121 clearly defined purpose undefined 112 no 29 6 I can abstain from food, sleep, going Nominal yes 248 to classes, if a have a chance to use undefined 7 the Internet for free no 7 7 I prefer meeting new people over the Nominal yes 185 Internet rather than in real life undefined 37 no 40 8 I often feel that I’ve spent not enough Nominal yes 178 time playing computer games over undefined 61 the Internet, I constantly wish to play no 23 longer Let us look at one of cluster analysis algorithms [12]. Output matrix: x ⋯ x X= ⋮ ⋱ ⋮ . x ⋯ x Let us move to the matrix of standardized Z values with elements: z = ; where j = 1, 2, …, n – index number, і = 1, 2, ... , m – observation number; 63 x = ∑ x ; s = ∑ (x − x ) = x − (x ) . There are several ways to define the distance between two observations zi and zv: 1. weighted Euclidean distance, which is determined by the formula ρ (z , z ) = ∑ w (z − z ) ; where wl is the “weight” of index; 0