=Paper= {{Paper |id=Vol-2546/paper03 |storemode=property |title=An empirical comparison of machine learning clustering methods in the study of Internet addiction among students majoring in Computer Sciences |pdfUrl=https://ceur-ws.org/Vol-2546/paper03.pdf |volume=Vol-2546 |authors=Oksana Klochko,Vasyl Fedorets }} ==An empirical comparison of machine learning clustering methods in the study of Internet addiction among students majoring in Computer Sciences== https://ceur-ws.org/Vol-2546/paper03.pdf
58


 An empirical comparison of machine learning clustering
    methods in the study of Internet addiction among
        students majoring in Computer Sciences

                              Oksana Klochko[0000-0002-6505-9455]

                Vinnytsia Mykhailo Kotsiubynskyi State Pedagogical University,
                     32, Ostrozhskogo Str., Vinnytsia, 21100, Ukraine
                               klochkoob@gmail.com

                               Vasyl Fedorets[0000-0001-9936-3458]

                     Institute of Higher Education of the NAES of Ukraine,
                            9, Bastionna Str., Kyiv, 01014, Ukraine
                                  bruney333@yahoo.com



       Abstract. One of the relevant current vectors of study in machine learning is the
       analysis of the application peculiarities for methods of solving a specific
       problem. We will study this issue on the example of methods of solving the
       clustering problem. Currently, we have a considerable number of learning
       algorithms which can be used for clustering. However, not all methods can be
       used for solving a specific task. The article describes the technology of empirical
       comparison of methods of clustering problem solving using WEKA free software
       for machine learning. Empirical comparison of data clustering methods was
       based on the results of a survey conducted among students majoring in Computer
       Sciences and dedicated to detecting signs of Internet addiction (IA) as
       behavioural disorder that occurs due to Internet misuse. Empirical comparison of
       Expectation Maximization, Farthest First and K-Means clustering algorithms
       together with the application of the WEKA machine learning system had the
       following results. It described the peculiarities of application of these methods in
       feature clustering. The authors developed data instances’ clustering models to
       detect signs of Internet addiction among students majoring in Computer Sciences.
       The study concludes that these methods may be applicable to development of
       models detecting respondent groups with signs of IA related disorders.

       Keywords: Empirical Comparison, Machine Learning, Clustering, Internet
       addiction (IA), IA detection, Internet disorders, Expectation Maximization,
       Farthest First, K-Means.


1      Introduction

One of research directions in machine learning is the empirical analysis of methods of
solving a specific problem. Let us study this issue on the example of methods of solving

___________________
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
                                                                                      59


the clustering problem. Clustering methods are statistic methods of data analysis that
enable people to group the given selection of data samples into clusters, classes, taxons
depending on the value of their attributes; each of these groups has certain
characteristics. The main idea is to use several clustering methods in order to carry out
an empirical comparison study and determine which methods ensure the most optimal
data grouping while solving a specific problem.
   Machine learning classifies clustering problems as problems for unsupervised
learning. Currently, there is a considerable number of machine learning algorithms that
can be used for clustering, for instance, Expectation Maximization, K-Means,
Hierarchical Clustering etc. But not all of them are suitable for solving a specific
problem. Data clustering algorithms differ by the cluster model type, the algorithm
model type, the nesting hierarchy of clusters, the way of implementation depending on
the data set etc. Because of this, there are also certain requirements to the data set
parameters.
   Popular software products used in machine learning include TensorFlow, WEKA,
MATLAB, MXNet, Torch, PyTorch, Microsoft Azure Machine Learning Studio and
others.
   In this article, we use the WEKA (Waikato Environment for Knowledge Analysis)
free machine learning software [19]. The free WEKA machine learning system gives
direct access to the library of implemented algorithms written in Java.
   Analysis of contemporary studies and publications shows that the issue of analysis
and selection of the machine learning method, which would be optimal for processing
a concrete data set, is popular in the scientific circles. A considerable number of these
studies is dedicated to the application of machine learning methods in the fields of
healthcare and life safety.
   In their article A Performance Comparison of Machine Learning Classification
Approaches for Robust Activity of Daily Living Recognition scientists Rida Ghafoor
Hussain, Mustansar Ali Ghazanfar, Muhammad Awais Azam, Usman Naeem and
Shafiq Ur Rehman studied the application of the machine learning classification
methods to find ways to ensure independent daily living of people who have
Alzheimer’s disease [7]. The idea of the study is to analyze the data registered by
different equipment in order to determine the changes in a person’s behavior that are
relevant for the daily life and social interaction. The paper gives a comparison of the
efficiency levels of five machine learning classification techniques used for the
recognition of a person’s activity (and his/her psychological status). Experimental
findings show that compared to traditional methodologies, these approaches give better
results in determining the activity of the person and his/her psychological and
behavioral peculiarities.
   Jonas Krämer, Jonas Schreyögg and Reinhard Busse studied the speed and efficiency
of medical aid provision using the databases of the Hospital ER [13]. Applying the
Random forest algorithm, the authors developed the model based on the data about the
patient’s provisional diagnosis. The use of the controlled machine learning method and
model training based on the opinion of a specialized doctor allowed them to achieve
high forecasting accuracy (96%) as well as the area under the receiver operating curve
(>0.99).
60


    Abdulhamit Subasi, Jasmin Kevric and M. Abdullah Canbaz developed a hybrid
model of detecting epileptic fits using the Genetic Algorithm (GA) and Particle Swarm
Optimization (PSO) to determine the optimal parameters of application of the Support
Vector Machine (SVM) algorithm [17]. The hybrid algorithm that they suggested can
demonstrate data set classification accuracy of up to 99.38%.
    A considerable number of papers appeared, which are dedicated to diagnosing
Internet addiction (IA) and studying the mechanisms of this disorder among various
social groups. The appearance and use of the Internet has many benefits. However, at
the same time, disorders related to pathological use of the Internet are becoming a social
as well as a psychological problem. Currently, we face an important psychological,
sociocultural and educational issue of detection and prevention of certain pathologies
and steady premorbid conditions (state before the disease) caused by inadequate
Internet use. Cases of IA were first mentioned in 1995 and attracted considerable
attention. Issues related to this one became the research subject of many scientists,
including Lyudmyla Yuryeva and Tatyana Bolbot [10], Marharyta Derhach [4] and
others. Internet Addiction Disorder (IAD) is also called Pathological Internet Use
(PIU). The term “Internet Addiction” was first suggested by Ivan K. Goldberg in 1995.
He describes net addiction as a specific pathology characterized by a wide spectrum of
behavioral and impulse control disorders (lack of control, absence of voluntary
regulation) [1]. In 1996 Goldberg made the first attempt to determine groups of
behavioural and psychological signs and symptoms of IA [18], namely: tolerance;
abstinence syndrome; difficulties in voluntary regulation of Internet-behaviour;
increase of time and financial investments in things related to Internet or computer use;
a shift of a person’s interests towards Internet-related activities; extensive Internet use
that leads to maladjustment. In 1998 Kimberly S. Young defined IAD as an impulsive-
compulsive disorder, which has specific signs or addictions [20; 21]: cyber-sexual
addiction, cyber-relationship addiction, net compulsions, information overload and
computer addiction. IAD is not officially included into ICD-11 for Mortality and
Morbidity Statistics (Version: 04/2019), however, in section 6C51 Gaming disorder the
“Gaming disorder” is described as a “pattern of persistent or recurrent gaming
behaviour (‘digital gaming’ or ‘video-gaming’), which may be online (i.e., over the
Internet)” [8].
    Even though the problem of IA is becoming more and more relevant, there are not
enough scientific papers dedicated to the study of this issue with the help of machine
learning methods. Let us look at some of them. On the basis of the Support Vector
Machine algorithm, including the C-SVM and ν-SVM, and applying the Student’s t-
test to the data set of the survey conducted among 2,397 Chinese students, scientists
Zonglin Di, Xiaoliang Gong, Jingyu Shi, Hosameldin O. A. Ahmed and Asoke K.
Nandic proved the utility of using machine learning methods for detecting and
forecasting the risk of IA [5]. Wen-Huai Hsieh, Dong-Her Shih, Po-Yuan Shih and
Shih-Bin Lin suggested using the EMBAR protected system of web-services based on
the ensemble classification methods and case-based reasoning to study the IA of the
users and prevent the development of this disorder at the initial stages [6]. Hong-Ming
Ji, Liang-Yu Chen and Tzu-Chien Hsiao are currently continuing their research, which
aims to create an IA detector that would work in a real-time mode [9]. The authors
                                                                                           61


suggest studying this issue using an adapted system of continuous real-coded variables
(XCSR), which determines the level of Internet addiction (high-risk and low-risk) on
the basis of the information about the Internet users using the Chen Internet addiction
scale (CIAS) or respiratory instantaneous frequency (IF) [9].
   Thus, based on the above presented statement of the problem as well as taking into
consideration the insufficient amount of research on the application of machine learning
methods to IA diagnosing, we determine the aim of our research, which is to conduct
an empirical comparison of clustering methods within the WEKA machine learning
system in the course of studying the IA disorder among students majoring in Computer
Sciences.


2       Selection of methods and diagnostics

Data regarding the spread and severity of IA among students majoring in Computer
Sciences were received from an online survey, which used a questionnaire drafted with
the help of Google Forms. 263 students majoring in Computer Sciences and coming
from different oblasts of Ukraine participated in the experimental study. The data set is
presented in the ARFF format and consists of 8 attributes (Fig. 1). The data set contains
the fields described in Table 1.

    @relation answer_IA

    @attribute age numeric
    @attribute sex {female,male}
    @attribute 3 {no,undefined,yes}
    @attribute 4 {no,undefined,yes}
    @attribute 5 {no,undefined,yes}
    @attribute 6 {no,undefined,yes}
    @attribute 7 {no,undefined,yes}
    @attribute 8 {no,undefined,yes}

    @data
    18,male,yes,no,no,no,no,yes
    28,male,undefined,no,no,no,no,yes
    20,female,yes,yes,yes,no,no,no
    22,male,yes,no,no,no,no,no
    …
Fig. 1. Data set on the state of IA among students majoring in Computer Sciences, presented in
                                        the ARFF format

Cluster analysis is one of the tasks of database mining. Cluster analysis is a set of
methods of multidimensional observations or objects classification, based on defining
the concept of distance between the objects and their subsequent grouping (into
clusters, taxons, classes). The selection of a concrete cluster analysis method depends
62


on the purpose of classification [12]. At the same time, one does not need a priori
information about the population distribution. This approach is based on the following
presuppositions: objects that have a certain number of similar (different) features group
in one segment (cluster). The level of similarity (difference) between the objects that
belong to one segment (cluster) must be higher than the level of their similarity with
the objects that belong to other segments [12].

     Table 1. Data structure on the state of IA among students majoring in Computer Sciences.

Attributes Сontents/Questions                        Type        Statistics
age        Age of the student                        Numeric     Minimum 16
                                                                 Maximum 59
                                                                 Mean 19.756
                                                                 StdDev 6.806
sex            Student’s sex                         Nominal     Female 199
                                                                 Male 63
3              Can’t imagine my life without the Nominal         yes 184
               Internet                                          undefined 39
                                                                 no 39
4              When I cannot use the Internet I fell Nominal     yes 81
               anxiety, irritation                               undefined 134
                                                                 no 47
5              I like “surfing” the Net without a Nominal        yes 121
               clearly defined purpose                           undefined 112
                                                                 no 29
6              I can abstain from food, sleep, going Nominal     yes 248
               to classes, if a have a chance to use             undefined 7
               the Internet for free                             no 7
7              I prefer meeting new people over the Nominal      yes 185
               Internet rather than in real life                 undefined 37
                                                                 no 40
8              I often feel that I’ve spent not enough Nominal   yes 178
               time playing computer games over                  undefined 61
               the Internet, I constantly wish to play           no 23
               longer

     Let us look at one of cluster analysis algorithms [12].
     Output matrix:
                                         x       ⋯ x
                                  X=         ⋮   ⋱   ⋮      .
                                         x       ⋯ x
Let us move to the matrix of standardized Z values with elements:

                                         z =         ;

where j = 1, 2, …, n – index number, і = 1, 2, ... , m – observation number;
                                                                                       63


                                    x =    ∑      x ;


                    s =       ∑    (x − x ) =            x          − (x ) .

There are several ways to define the distance between two observations zi and zv:

1. weighted Euclidean distance, which is determined by the formula

                        ρ (z , z ) =      ∑      w (z − z ) ;

where wl is the “weight” of index; 0