=Paper= {{Paper |id=Vol-2546/paper03 |storemode=property |title=An empirical comparison of machine learning clustering methods in the study of Internet addiction among students majoring in Computer Sciences |pdfUrl=https://ceur-ws.org/Vol-2546/paper03.pdf |volume=Vol-2546 |authors=Oksana Klochko,Vasyl Fedorets }} ==An empirical comparison of machine learning clustering methods in the study of Internet addiction among students majoring in Computer Sciences== https://ceur-ws.org/Vol-2546/paper03.pdf

An empirical comparison of machine learning clustering
methods in the study of Internet addiction among
students majoring in Computer Sciences

Oksana Klochko[0000-0002-6505-9455]

Vinnytsia Mykhailo Kotsiubynskyi State Pedagogical University,
32, Ostrozhskogo Str., Vinnytsia, 21100, Ukraine
klochkoob@gmail.com

Vasyl Fedorets[0000-0001-9936-3458]

Institute of Higher Education of the NAES of Ukraine,
9, Bastionna Str., Kyiv, 01014, Ukraine
bruney333@yahoo.com

Abstract. One of the relevant current vectors of study in machine learning is the
analysis of the application peculiarities for methods of solving a specific
problem. We will study this issue on the example of methods of solving the
clustering problem. Currently, we have a considerable number of learning
algorithms which can be used for clustering. However, not all methods can be
used for solving a specific task. The article describes the technology of empirical
comparison of methods of clustering problem solving using WEKA free software
for machine learning. Empirical comparison of data clustering methods was
based on the results of a survey conducted among students majoring in Computer
Sciences and dedicated to detecting signs of Internet addiction (IA) as
behavioural disorder that occurs due to Internet misuse. Empirical comparison of
Expectation Maximization, Farthest First and K-Means clustering algorithms
together with the application of the WEKA machine learning system had the
following results. It described the peculiarities of application of these methods in
feature clustering. The authors developed data instances’ clustering models to
detect signs of Internet addiction among students majoring in Computer Sciences.
The study concludes that these methods may be applicable to development of
models detecting respondent groups with signs of IA related disorders.

Keywords: Empirical Comparison, Machine Learning, Clustering, Internet
addiction (IA), IA detection, Internet disorders, Expectation Maximization,
Farthest First, K-Means.

1 Introduction

One of research directions in machine learning is the empirical analysis of methods of
solving a specific problem. Let us study this issue on the example of methods of solving

___________________
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
59

the clustering problem. Clustering methods are statistic methods of data analysis that
enable people to group the given selection of data samples into clusters, classes, taxons
depending on the value of their attributes; each of these groups has certain
characteristics. The main idea is to use several clustering methods in order to carry out
an empirical comparison study and determine which methods ensure the most optimal
data grouping while solving a specific problem.
Machine learning classifies clustering problems as problems for unsupervised
learning. Currently, there is a considerable number of machine learning algorithms that
can be used for clustering, for instance, Expectation Maximization, K-Means,
Hierarchical Clustering etc. But not all of them are suitable for solving a specific
problem. Data clustering algorithms differ by the cluster model type, the algorithm
model type, the nesting hierarchy of clusters, the way of implementation depending on
the data set etc. Because of this, there are also certain requirements to the data set
parameters.
Popular software products used in machine learning include TensorFlow, WEKA,
MATLAB, MXNet, Torch, PyTorch, Microsoft Azure Machine Learning Studio and
others.
In this article, we use the WEKA (Waikato Environment for Knowledge Analysis)
free machine learning software [19]. The free WEKA machine learning system gives
direct access to the library of implemented algorithms written in Java.
Analysis of contemporary studies and publications shows that the issue of analysis
and selection of the machine learning method, which would be optimal for processing
a concrete data set, is popular in the scientific circles. A considerable number of these
studies is dedicated to the application of machine learning methods in the fields of
healthcare and life safety.
In their article A Performance Comparison of Machine Learning Classification
Approaches for Robust Activity of Daily Living Recognition scientists Rida Ghafoor
Hussain, Mustansar Ali Ghazanfar, Muhammad Awais Azam, Usman Naeem and
Shafiq Ur Rehman studied the application of the machine learning classification
methods to find ways to ensure independent daily living of people who have
Alzheimer’s disease [7]. The idea of the study is to analyze the data registered by
different equipment in order to determine the changes in a person’s behavior that are
relevant for the daily life and social interaction. The paper gives a comparison of the
efficiency levels of five machine learning classification techniques used for the
recognition of a person’s activity (and his/her psychological status). Experimental
findings show that compared to traditional methodologies, these approaches give better
results in determining the activity of the person and his/her psychological and
behavioral peculiarities.
Jonas Krämer, Jonas Schreyögg and Reinhard Busse studied the speed and efficiency
of medical aid provision using the databases of the Hospital ER [13]. Applying the
Random forest algorithm, the authors developed the model based on the data about the
patient’s provisional diagnosis. The use of the controlled machine learning method and
model training based on the opinion of a specialized doctor allowed them to achieve
high forecasting accuracy (96%) as well as the area under the receiver operating curve
(>0.99).
60

Abdulhamit Subasi, Jasmin Kevric and M. Abdullah Canbaz developed a hybrid
model of detecting epileptic fits using the Genetic Algorithm (GA) and Particle Swarm
Optimization (PSO) to determine the optimal parameters of application of the Support
Vector Machine (SVM) algorithm [17]. The hybrid algorithm that they suggested can
demonstrate data set classification accuracy of up to 99.38%.
A considerable number of papers appeared, which are dedicated to diagnosing
Internet addiction (IA) and studying the mechanisms of this disorder among various
social groups. The appearance and use of the Internet has many benefits. However, at
the same time, disorders related to pathological use of the Internet are becoming a social
as well as a psychological problem. Currently, we face an important psychological,
sociocultural and educational issue of detection and prevention of certain pathologies
and steady premorbid conditions (state before the disease) caused by inadequate
Internet use. Cases of IA were first mentioned in 1995 and attracted considerable
attention. Issues related to this one became the research subject of many scientists,
including Lyudmyla Yuryeva and Tatyana Bolbot [10], Marharyta Derhach [4] and
others. Internet Addiction Disorder (IAD) is also called Pathological Internet Use
(PIU). The term “Internet Addiction” was first suggested by Ivan K. Goldberg in 1995.
He describes net addiction as a specific pathology characterized by a wide spectrum of
behavioral and impulse control disorders (lack of control, absence of voluntary
regulation) [1]. In 1996 Goldberg made the first attempt to determine groups of
behavioural and psychological signs and symptoms of IA [18], namely: tolerance;
abstinence syndrome; difficulties in voluntary regulation of Internet-behaviour;
increase of time and financial investments in things related to Internet or computer use;
a shift of a person’s interests towards Internet-related activities; extensive Internet use
that leads to maladjustment. In 1998 Kimberly S. Young defined IAD as an impulsive-
compulsive disorder, which has specific signs or addictions [20; 21]: cyber-sexual
addiction, cyber-relationship addiction, net compulsions, information overload and
computer addiction. IAD is not officially included into ICD-11 for Mortality and
Morbidity Statistics (Version: 04/2019), however, in section 6C51 Gaming disorder the
“Gaming disorder” is described as a “pattern of persistent or recurrent gaming
behaviour (‘digital gaming’ or ‘video-gaming’), which may be online (i.e., over the
Internet)” [8].
Even though the problem of IA is becoming more and more relevant, there are not
enough scientific papers dedicated to the study of this issue with the help of machine
learning methods. Let us look at some of them. On the basis of the Support Vector
Machine algorithm, including the C-SVM and ν-SVM, and applying the Student’s t-
test to the data set of the survey conducted among 2,397 Chinese students, scientists
Zonglin Di, Xiaoliang Gong, Jingyu Shi, Hosameldin O. A. Ahmed and Asoke K.
Nandic proved the utility of using machine learning methods for detecting and
forecasting the risk of IA [5]. Wen-Huai Hsieh, Dong-Her Shih, Po-Yuan Shih and
Shih-Bin Lin suggested using the EMBAR protected system of web-services based on
the ensemble classification methods and case-based reasoning to study the IA of the
users and prevent the development of this disorder at the initial stages [6]. Hong-Ming
Ji, Liang-Yu Chen and Tzu-Chien Hsiao are currently continuing their research, which
aims to create an IA detector that would work in a real-time mode [9]. The authors
61

suggest studying this issue using an adapted system of continuous real-coded variables
(XCSR), which determines the level of Internet addiction (high-risk and low-risk) on
the basis of the information about the Internet users using the Chen Internet addiction
scale (CIAS) or respiratory instantaneous frequency (IF) [9].
Thus, based on the above presented statement of the problem as well as taking into
consideration the insufficient amount of research on the application of machine learning
methods to IA diagnosing, we determine the aim of our research, which is to conduct
an empirical comparison of clustering methods within the WEKA machine learning
system in the course of studying the IA disorder among students majoring in Computer
Sciences.

2 Selection of methods and diagnostics

Data regarding the spread and severity of IA among students majoring in Computer
Sciences were received from an online survey, which used a questionnaire drafted with
the help of Google Forms. 263 students majoring in Computer Sciences and coming
from different oblasts of Ukraine participated in the experimental study. The data set is
presented in the ARFF format and consists of 8 attributes (Fig. 1). The data set contains
the fields described in Table 1.

@relation answer_IA

@attribute age numeric
@attribute sex {female,male}
@attribute 3 {no,undefined,yes}
@attribute 4 {no,undefined,yes}
@attribute 5 {no,undefined,yes}
@attribute 6 {no,undefined,yes}
@attribute 7 {no,undefined,yes}
@attribute 8 {no,undefined,yes}

@data
18,male,yes,no,no,no,no,yes
28,male,undefined,no,no,no,no,yes
20,female,yes,yes,yes,no,no,no
22,male,yes,no,no,no,no,no
…
Fig. 1. Data set on the state of IA among students majoring in Computer Sciences, presented in
the ARFF format

Cluster analysis is one of the tasks of database mining. Cluster analysis is a set of
methods of multidimensional observations or objects classification, based on defining
the concept of distance between the objects and their subsequent grouping (into
clusters, taxons, classes). The selection of a concrete cluster analysis method depends
62

on the purpose of classification [12]. At the same time, one does not need a priori
information about the population distribution. This approach is based on the following
presuppositions: objects that have a certain number of similar (different) features group
in one segment (cluster). The level of similarity (difference) between the objects that
belong to one segment (cluster) must be higher than the level of their similarity with
the objects that belong to other segments [12].

Table 1. Data structure on the state of IA among students majoring in Computer Sciences.

Attributes Сontents/Questions Type Statistics
age Age of the student Numeric Minimum 16
Maximum 59
Mean 19.756
StdDev 6.806
sex Student’s sex Nominal Female 199
Male 63
3 Can’t imagine my life without the Nominal yes 184
Internet undefined 39
no 39
4 When I cannot use the Internet I fell Nominal yes 81
anxiety, irritation undefined 134
no 47
5 I like “surfing” the Net without a Nominal yes 121
clearly defined purpose undefined 112
no 29
6 I can abstain from food, sleep, going Nominal yes 248
to classes, if a have a chance to use undefined 7
the Internet for free no 7
7 I prefer meeting new people over the Nominal yes 185
Internet rather than in real life undefined 37
no 40
8 I often feel that I’ve spent not enough Nominal yes 178
time playing computer games over undefined 61
the Internet, I constantly wish to play no 23
longer

Let us look at one of cluster analysis algorithms [12].
Output matrix:
x ⋯ x
X= ⋮ ⋱ ⋮ .
x ⋯ x
Let us move to the matrix of standardized Z values with elements:

z = ;

where j = 1, 2, …, n – index number, і = 1, 2, ... , m – observation number;
63

x = ∑ x ;

s = ∑ (x − x ) = x − (x ) .

There are several ways to define the distance between two observations zi and zv:

1. weighted Euclidean distance, which is determined by the formula

ρ (z , z ) = ∑ w (z − z ) ;

where wl is the “weight” of index; 0