<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An empirical comparison of machine learning clustering methods in the study of Internet addiction among students majoring in Computer Sciences</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Higher Education of the NAES of Ukraine</institution>
          ,
          <addr-line>9, Bastionna Str., Kyiv, 01014</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vinnytsia Mykhailo Kotsiubynskyi State Pedagogical University</institution>
          ,
          <addr-line>32, Ostrozhskogo Str., Vinnytsia, 21100</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>One of the relevant current vectors of study in machine learning is the analysis of the application peculiarities for methods of solving a specific problem. We will study this issue on the example of methods of solving the clustering problem. Currently, we have a considerable number of learning algorithms which can be used for clustering. However, not all methods can be used for solving a specific task. The article describes the technology of empirical comparison of methods of clustering problem solving using WEKA free software for machine learning. Empirical comparison of data clustering methods was based on the results of a survey conducted among students majoring in Computer Sciences and dedicated to detecting signs of Internet addiction (IA) as behavioural disorder that occurs due to Internet misuse. Empirical comparison of Expectation Maximization, Farthest First and K-Means clustering algorithms together with the application of the WEKA machine learning system had the following results. It described the peculiarities of application of these methods in feature clustering. The authors developed data instances' clustering models to detect signs of Internet addiction among students majoring in Computer Sciences. The study concludes that these methods may be applicable to development of models detecting respondent groups with signs of IA related disorders.</p>
      </abstract>
      <kwd-group>
        <kwd>Empirical Comparison</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Clustering</kwd>
        <kwd>Internet addiction (IA)</kwd>
        <kwd>IA detection</kwd>
        <kwd>Internet disorders</kwd>
        <kwd>Expectation Maximization</kwd>
        <kwd>Farthest First</kwd>
        <kwd>K-Means</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>One of research directions in machine learning is the empirical analysis of methods of
solving a specific problem. Let us study this issue on the example of methods of solving
___________________
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
the clustering problem. Clustering methods are statistic methods of data analysis that
enable people to group the given selection of data samples into clusters, classes, taxons
depending on the value of their attributes; each of these groups has certain
characteristics. The main idea is to use several clustering methods in order to carry out
an empirical comparison study and determine which methods ensure the most optimal
data grouping while solving a specific problem.</p>
      <p>Machine learning classifies clustering problems as problems for unsupervised
learning. Currently, there is a considerable number of machine learning algorithms that
can be used for clustering, for instance, Expectation Maximization, K-Means,
Hierarchical Clustering etc. But not all of them are suitable for solving a specific
problem. Data clustering algorithms differ by the cluster model type, the algorithm
model type, the nesting hierarchy of clusters, the way of implementation depending on
the data set etc. Because of this, there are also certain requirements to the data set
parameters.</p>
      <p>Popular software products used in machine learning include TensorFlow, WEKA,
MATLAB, MXNet, Torch, PyTorch, Microsoft Azure Machine Learning Studio and
others.</p>
      <p>
        In this article, we use the WEKA (Waikato Environment for Knowledge Analysis)
free machine learning software [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The free WEKA machine learning system gives
direct access to the library of implemented algorithms written in Java.
      </p>
      <p>Analysis of contemporary studies and publications shows that the issue of analysis
and selection of the machine learning method, which would be optimal for processing
a concrete data set, is popular in the scientific circles. A considerable number of these
studies is dedicated to the application of machine learning methods in the fields of
healthcare and life safety.</p>
      <p>
        In their article A Performance Comparison of Machine Learning Classification
Approaches for Robust Activity of Daily Living Recognition scientists Rida Ghafoor
Hussain, Mustansar Ali Ghazanfar, Muhammad Awais Azam, Usman Naeem and
Shafiq Ur Rehman studied the application of the machine learning classification
methods to find ways to ensure independent daily living of people who have
Alzheimer’s disease [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The idea of the study is to analyze the data registered by
different equipment in order to determine the changes in a person’s behavior that are
relevant for the daily life and social interaction. The paper gives a comparison of the
efficiency levels of five machine learning classification techniques used for the
recognition of a person’s activity (and his/her psychological status). Experimental
findings show that compared to traditional methodologies, these approaches give better
results in determining the activity of the person and his/her psychological and
behavioral peculiarities.
      </p>
      <p>
        Jonas Krämer, Jonas Schreyögg and Reinhard Busse studied the speed and efficiency
of medical aid provision using the databases of the Hospital ER [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Applying the
Random forest algorithm, the authors developed the model based on the data about the
patient’s provisional diagnosis. The use of the controlled machine learning method and
model training based on the opinion of a specialized doctor allowed them to achieve
high forecasting accuracy (96%) as well as the area under the receiver operating curve
(&gt;0.99).
      </p>
      <p>
        Abdulhamit Subasi, Jasmin Kevric and M. Abdullah Canbaz developed a hybrid
model of detecting epileptic fits using the Genetic Algorithm (GA) and Particle Swarm
Optimization (PSO) to determine the optimal parameters of application of the Support
Vector Machine (SVM) algorithm [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The hybrid algorithm that they suggested can
demonstrate data set classification accuracy of up to 99.38%.
      </p>
      <p>
        A considerable number of papers appeared, which are dedicated to diagnosing
Internet addiction (IA) and studying the mechanisms of this disorder among various
social groups. The appearance and use of the Internet has many benefits. However, at
the same time, disorders related to pathological use of the Internet are becoming a social
as well as a psychological problem. Currently, we face an important psychological,
sociocultural and educational issue of detection and prevention of certain pathologies
and steady premorbid conditions (state before the disease) caused by inadequate
Internet use. Cases of IA were first mentioned in 1995 and attracted considerable
attention. Issues related to this one became the research subject of many scientists,
including Lyudmyla Yuryeva and Tatyana Bolbot [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Marharyta Derhach [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
others. Internet Addiction Disorder (IAD) is also called Pathological Internet Use
(PIU). The term “Internet Addiction” was first suggested by Ivan K. Goldberg in 1995.
He describes net addiction as a specific pathology characterized by a wide spectrum of
behavioral and impulse control disorders (lack of control, absence of voluntary
regulation) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In 1996 Goldberg made the first attempt to determine groups of
behavioural and psychological signs and symptoms of IA [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], namely: tolerance;
abstinence syndrome; difficulties in voluntary regulation of Internet-behaviour;
increase of time and financial investments in things related to Internet or computer use;
a shift of a person’s interests towards Internet-related activities; extensive Internet use
that leads to maladjustment. In 1998 Kimberly S. Young defined IAD as an
impulsivecompulsive disorder, which has specific signs or addictions [20; 21]: cyber-sexual
addiction, cyber-relationship addiction, net compulsions, information overload and
computer addiction. IAD is not officially included into ICD-11 for Mortality and
Morbidity Statistics (Version: 04/2019), however, in section 6C51 Gaming disorder the
“Gaming disorder” is described as a “pattern of persistent or recurrent gaming
behaviour (‘digital gaming’ or ‘video-gaming’), which may be online (i.e., over the
Internet)” [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Even though the problem of IA is becoming more and more relevant, there are not
enough scientific papers dedicated to the study of this issue with the help of machine
learning methods. Let us look at some of them. On the basis of the Support Vector
Machine algorithm, including the C-SVM and ν-SVM, and applying the Student’s
ttest to the data set of the survey conducted among 2,397 Chinese students, scientists
Zonglin Di, Xiaoliang Gong, Jingyu Shi, Hosameldin O. A. Ahmed and Asoke K.
Nandic proved the utility of using machine learning methods for detecting and
forecasting the risk of IA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Wen-Huai Hsieh, Dong-Her Shih, Po-Yuan Shih and
Shih-Bin Lin suggested using the EMBAR protected system of web-services based on
the ensemble classification methods and case-based reasoning to study the IA of the
users and prevent the development of this disorder at the initial stages [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Hong-Ming
Ji, Liang-Yu Chen and Tzu-Chien Hsiao are currently continuing their research, which
aims to create an IA detector that would work in a real-time mode [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The authors
suggest studying this issue using an adapted system of continuous real-coded variables
(XCSR), which determines the level of Internet addiction (high-risk and low-risk) on
the basis of the information about the Internet users using the Chen Internet addiction
scale (CIAS) or respiratory instantaneous frequency (IF) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Thus, based on the above presented statement of the problem as well as taking into
consideration the insufficient amount of research on the application of machine learning
methods to IA diagnosing, we determine the aim of our research, which is to conduct
an empirical comparison of clustering methods within the WEKA machine learning
system in the course of studying the IA disorder among students majoring in Computer
Sciences.
2</p>
      <p>
        Selection of methods and diagnostics
Data regarding the spread and severity of IA among students majoring in Computer
Sciences were received from an online survey, which used a questionnaire drafted with
the help of Google Forms. 263 students majoring in Computer Sciences and coming
from different oblasts of Ukraine participated in the experimental study. The data set is
presented in the ARFF format and consists of 8 attributes (Fig. 1). The data set contains
the fields described in Table 1.
Cluster analysis is one of the tasks of database mining. Cluster analysis is a set of
methods of multidimensional observations or objects classification, based on defining
the concept of distance between the objects and their subsequent grouping (into
clusters, taxons, classes). The selection of a concrete cluster analysis method depends
on the purpose of classification [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. At the same time, one does not need a priori
information about the population distribution. This approach is based on the following
presuppositions: objects that have a certain number of similar (different) features group
in one segment (cluster). The level of similarity (difference) between the objects that
belong to one segment (cluster) must be higher than the level of their similarity with
the objects that belong to other segments [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
age
sex
3
4
5
6
7
8
x
      </p>
      <p>⋮</p>
    </sec>
    <sec id="sec-2">
      <title>Let us look at one of cluster analysis algorithms [12]. Output matrix:</title>
      <p>Type
Numeric
Nominal</p>
      <p>Statistics
Minimum 16
Maximum 59
Mean 19.756
StdDev 6.806
Female 199
Male 63
yes 184
undefined 39
no 39
yes 81
undefined 134
no 47
yes 121
undefined 112
no 29
yes 248
undefined 7
no 7
yes 185
undefined 37
no 40
yes 178
undefined 61
no 23
X =
.</p>
      <p>Let us move to the matrix of standardized Z values with elements:
where j = 1, 2, …, n – index number, і = 1, 2, ... , m – observation number;</p>
      <p>There are several ways to define the distance between two observations zi and zv:
1. weighted Euclidean distance, which is determined by the formula
ρ (z , z ) =
∑
w (z − z ) ;
where wl is the “weight” of index; 0&lt;wl≤1; if wl=1 for all l = 1, 2, …, n, then we get the
usual Euclidean distance
ρ (z , z ) =
∑</p>
      <p>(z − z ) ;
ρBH(zi,zv) = ∑
|zi -zv |;
2. Hamming distance:
in most cases this way of distance measuring gives the same result as the usual
Euclidean distance, but in this case the influence of non-systemic large differences
(runouts) decreases;
3. Chebyshev distance:</p>
      <p>ρBCH(zi,zv) = max |zi -zv |;
it is best to apply this distance in order to determine the differences existing between
the two objects using only one dimension;
4. Mahalanobis distance:
ρBМ(zi,zv) =
( −
)
( −
),
where S is covariance matrix; this distance measurement gives good results when
applied to a concrete data group, but it does not work very well, if the covariance matrix
is calculated for the whole data set;
5. Distance between peaks:
ρBL(zi,zv) =
∑
|zi -zv |;
zi +zv
presupposes independence of random variables, which indicates the distance in the
orthogonal space.</p>
      <p>It is best to choose from the above described distance measures after the
consideration of the structure and characteristics of the data sample.</p>
      <p>Let us present the received measurements in the form of distance matrix:
As the R matrix is symmetric, i.e. ρiv=ρvi, we may confine ourselves to off-diagonal
matrix elements.</p>
      <p>Using the distance matrix, we can implement the agglomerative hierarchic procedure
of cluster analysis. Distances between clusters are determined as the closest or the
farthest ones. In the first case, the distance between the clusters is the one between the
closest elements of these clusters, in the second case, it is the one between the two
farthermost located. The principle of the work of agglomerative hierarchic procedures
lies in a consequent grouping of elements, starting from the ones closest to each other
and those that are farther and farther apart. During the first step of the algorithm, every
observation zі (i = 1, 2, ..., m) is viewed as a separate cluster. Then, during every next
step of the work of the algorithm, two closest located clusters are grouped together and
then once again the distance matrix is built, but its dimension decreases by one. The
algorithm stops its work when all the observations are grouped into clusters.</p>
      <p>
        Let us look at the algorithms we used while clustering the data set regarding the state
of IA disorder among students majoring in Computer Sciences:
1. EM (Expectation Maximization)
Determines the probability distribution for every object, which indicated its
belongingness to each cluster. EM methods [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]: Maximum Likelihood Estimation
(MLE) or Maximum a Posteriori (MAP). Description of the algorithm is shown in
Fig. 2 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]: at the E-stage (expectation) we calculate the estimated likelihood; at the
Mstage (Maximization) we calculate the maximum likelihood estimation, increasing the
expected likelihood, calculated at the E-stage; its value is used for the E-stage at the
next iteration. The algorithm is repeated until its convergence.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. K-Means algorithm</title>
      <p>
        Aims to partition n observations into k clusters in such a way that each observation
belongs to the cluster with the nearest mean value. The shortest distance between the
observations and the nearest mean value may be calculated by minimizing the sum of
squares of the distances [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] (Fig. 3).
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Farthest First algorithm</title>
      <p>
        This is a modification of a K-Means algorithm, in which the initial selection of centroids
is 2 and higher. Centroids are determined following the remoteness principle, i.e. the
point farthest from the rest is selected first. The Farthest First algorithm is described in
Fig. 4 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
To cluster data using the WEKA platform, we will use Weka.clusterers.EM,
Weka.clusterers.SimpleKMeans and Weka.clusterers.FarthestFirst algorithms [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>We check the application of clustering algorithms that can be assigned to two classes
of clustering algorithms, i.e. distribution based (Expectation Maximization) and
centroid-based (K-Means, Farthest First). Such selection is motivated by the fact these
algorithms have long been used to cluster different types of data in many fields and are
considered to be effective.</p>
      <p>Dunn, DB, SD, СDbw and S_Dbw were selected as validity indices for testing [2;
15; 16] (Table 2). In the CDbw index the distance from the point to multitude set in the
course of selecting cluster element can be calculated in different ways. In this study,
we use the sum of distances of already existing “representatives” of the cluster to each
cluster element to calculate this distance. The element, on which the maximum was
reached, was selected as the next “representative” of the cluster.</p>
      <p>If the data set has no cluster structure, then such situation is not determined with the
help of validity metrics. While using K-Means and Farthest First (Table 2) the numbers
of clusters for the two algorithms that were selected as optimal by the majority of
indices, can only nominally be defined as cluster structure. As the work of Expectation
Maximization algorithm is based on determining the probability of evaluating
maximum similarity, the indices calculated for this algorithm are more homogenous.
The structure, which is characterized by a small number of clusters that also have to be
compact and separable, is determined to be the best one. Judging by the results of
evaluation of clustering using the validity indices, we may consider that k-Means and
Farthest First algorithms are most likely to give worse clustering results than the
Expectation Maximization algorithms.</p>
      <p>To cluster the data, we select training/testing using the percentage split option. As a
data set for training (model building) we select 66% of data from the set. As a data set
for testing we select 34% of data from the set. In addition, we select number of clusters
“3” in algorithm settings.</p>
      <p>We received the following results:
1. In the course of application of the EM clustering algorithm, according to the built
clustering model based on the training data set, three clusters were determined, their
characteristics are given in Table 3.</p>
      <p>Cluster 0 (63% of respondents): The average age of respondents in this cluster is 17.
The group consists predominantly of women. The characteristic feature of the
representatives of this group is that they are unable to imagine their life without the
Internet. There are variations in the levels of anxiety and irritation, if there is no
possibility to use the Internet. There are also varying opinions regarding the aimless
use of the Internet. As for other attributes, disorders related to IA may be observed in
the insignificant number of respondents, who belong to this cluster. The behavioural
model of the representatives of this cluster demonstrated Internet centration in the
psychic reality of a personality, which is accordingly reflected in their activity and
behavior, other life interests as well as significance of everyday activities lose their
importance. The stated tendencies are linked to IA.</p>
      <p>Cluster 1 (13% of respondents): For the representatives of this group the average
value of the age attribute is 36 and it varies greatly. This is the oldest age group if
compared with other clusters. This group has the largest share of women.
Representatives of this group, predominantly, cannot imagine their life without the
Internet. Thus, according to the centroid values of the attributes, we may diagnose IA
related Internet centration in the psychic reality of a personality, which is accordingly
reflected in their activity and behavior; other life interests as well as significance of
everyday activities lose their importance. There are predominantly no other signs of IA
related disorders.</p>
      <p>Cluster 2 (24% of respondents): The probabilistic average of the age attribute among
the representatives of this group is middle-aged in comparison with other groups and is
19. Male representatives significantly dominate in this group. Regarding the inability
to imagine their life without the Internet, opinions differed, however, predominantly
respondents believe they have this addiction. Judging by the values of attributes 4, 5, 6
and 7, the vast majority of this group’s representatives declare that they do not have
other signs of IA. However, the feeling of the lack of time spent playing computer
games over the Internet, which was confirmed by the vast majority of respondents, is a
warning signal that may signify the existence of IA related disorders. Thus, the
characteristic feature of this group is that most of its representatives have IA related
disorders such as: Internet centration in the psychic reality of a personality; behavioral
impulse control disorders related to online gaming. These people are in the risk group
for developing IA related disorders.
2. In the course of application of the Farthest First algorithm, according to the built
clustering model based on the training data set, there have also been three clusters
formed; their characteristics are given in Table 4.</p>
      <p>Cluster 0: Contains data instances of the youngest age group, whose age centroid
attribute is 16. According to the value of the sex centroid attribute, the group is made
up of mostly female data instances. The representatives of this group cannot imagine
their life without the Internet, i.e. there is obvious Internet centration in the psychic
reality of a personality. Respondents cannot clearly determine whether they feel either
anxiety or irritation if they do not have the possibility to use the Internet. Judging by
other attributes, data instances of this cluster do not have IA related disorders.</p>
      <p>Cluster 1: This cluster contains data instances of an older age group, the age attribute
centroid of which is 22. The value of the sex attribute centroid in this cluster is male. A
characteristic feature of the cluster is undecidedness regarding the vital need to use the
Internet, prevalence of Internet relations over actual real interactions, feeling the lack
of time spent playing computer games over the Internet (attributes 3, 7, 8 equal
undefined). The value of the yes centroid of attribute 5 shows inclination to use the
Internet without a concrete purpose. To give an overall characteristic, this group has
signs of IA, i.e. behavior control disorders related to Internet use.</p>
      <p>Cluster 2: By the value of the age attribute centroid, 20, this cluster contains data
instances of the middle age group if compared with other clusters. The sex attribute
centroid in this cluster is male. The representatives of this cluster cannot imagine their
life without the Internet and feel anxiety and irritation when they do not have the
possibility to use the Internet. They are characterized by their undecidedness regarding
the vital need to use the Internet; giving up other life interests and everyday activities
for the sake of free Internet use; prevalence of online relations of real-life interactions
(value of attributes 5, 6, 7 is undefined). Thus, the representatives of this cluster have
signs of IA, the priority significance of the Internet and behavior control disorders,
related to Internet use. Compared to other groups, they are in the risk group for
developing IA related disorders.
3. In the course of application of the K-Means algorithm to the clustering model built
on the basis of the training data set three clusters have also been formed, their
characteristics are presented in Table 5.</p>
      <p>Cluster 0: Contains data instances of the youngest age group, whose age attribute
centroid is about 18. According to the sex attribute centroid, mostly female data
instances are present in the groups. The representatives of this group cannot clearly
determine whether they have a vital need to use the Internet. As for other indices,
respondents state absence of signs of IA related disorders.</p>
      <p>Cluster 1: This cluster contains data instances of the older age group, whose age
attribute centroid is about 22. The value of the sex attribute centroid in this cluster is
male. Characteristic features of data instances that belong to this cluster include the
vital need to use the Internet, feeling the lack of time spent playing online computer
games as well as the systemic need to play longer. The overall characteristic of this
cluster is the presence of signs of IA, i.e. behavior control issues related to Internet use,
namely, gaming Internet addiction. If compared with other cluster, they belong to the
risk group that may develop IA related disorders.</p>
      <p>Cluster 2: By the value of age attribute centroid, which is about 21 years, compared
to other clusters, this cluster contains data instances of medium age group. The sex
attribute centroid is female. The representatives of this cluster cannot imagine their life
without the Internet. Judging by centroids of other characteristics, respondents of this
cluster do not have Internet-related disorders. Thus, the representatives of this cluster
have only IA signs associated with the utmost significance of the Internet.</p>
      <p>The cluster distribution of test data in the course of application of the three
algorithms – the Expectation Maximization, Farthest First and K-Means – using the
built training models is presented in Table 6. Thus, as it can be seen from the table, the
algorithms have determined three data groups. Clusters were formed, which included
71:12:7, 67:4:19 and 33:15:42 data instances respectively. There is a cluster that has
the largest number of data instances; a group, which has the least data instances
(exceptions); a group that includes several times more data instances than the smallest
group.</p>
      <p>Fig. 5, Fig. 6 and Fig. 7 present a graphic representation of clusters by age
characteristic of data instances, which are built using the training data set and received
in the course of implementation of the Expectation Maximization, the Farthest First and
the K-Means algorithm respectively. As we can see, the formed clusters differ from
each other by the age attribute. For instance, Cluster 0, which contains most data
instances, contains instances of respondents of a younger age, if formed through the
application of the Expectation Maximization algorithm (Fig. 5). On the other hand, the
same cluster received through the implementation of the Farthest First algorithm,
contains data instance of various age groups (Fig. 6). Also, a small number of data
instances of various age groups is present in Cluster 2, received in the course of
implementation of the K-Means algorithm (Fig. 7). Cluster 0 and Cluster 2 formed with
the Expectation Maximization algorithm as well as Cluster 1 and Cluster 2 formed with
the Farthest First algorithm contain homogeneous age groups, and Cluster 0 та Cluster
1, formed with K-Means algorithm.
Fig. 6. Plot of cluster distribution applying the Farthest First algorithm depending on the age
group attribute
In the course of empirical comparison of Expectation Maximization, Farthest First and
K-Means algorithms using the WEKA machine learning system to study the signs of
IA related disorders among the students majoring in Computer Sciences, the following
conclusions have been made:
1. As a result of empirical comparison of Expectation Maximization, Farthest First and
K-Means algorithms using the WEKA machine learning system, we developed
models of data instances’ clustering to determine the signs of internet addiction
disorders among students majoring in Computer Sciences.
2. The implementation of the Expectation Maximization, the K-Means and the Farthest
First algorithms each resulted in the formation of 3 clusters. The results of clustering
demonstrate that Internet centration in the psychic reality of a personality is a
characteristic feature of the respondents that took part in the survey. This also reflects
accordingly in their activity and behavior, diminishing other life interests and the
significance of everyday activities. In addition, in the course of implementation of
the Expectation Maximization algorithm, a cluster was formed, whose
representatives have behavior control disorders, related to online gaming. These
respondents are in the risk group for developing IA related disorders.
3. Expectation Maximization, Farthest First and K-Means algorithms of data clustering
differ by their algorithm model, however, from the point of characteristic features,
they produce relatively similar clusters, thus implementing optimized clustering. At
the same time, when a data set was grouped into three clusters by implementing these
algorithms, the clusters differed by cluster model, namely, by the number of data
instances in each cluster, their structure and value of attribute centroids.
4. Judging by the evaluation results of clustering validity using the validity indices, we
can state that most likely the K-Means and Farthest First algorithms show worse
clustering results than the Expectation Maximization algorithm.
5. Intellectual analysis of the data set regarding the situation with IA among students
majoring in Computer Sciences with the application of clustering methods has
shown that the methods studied above may be considered suitable for developing
models for detecting respondent groups with the signs of IA related disorders.
Our conclusions may help to determine the signs of IA related disorders among students
majoring in Computer Sciences, forecasting the risk of IA and development of services
aimed at IA prevention.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abbott</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cramer</surname>
            ,
            <given-names>S.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sherrets</surname>
          </string-name>
          , S.D.:
          <article-title>Pathological Gambling and the Family: Practice Implications</article-title>
          .
          <source>Families in Society: the Journal of Contemporary Social Services</source>
          <volume>76</volume>
          (
          <issue>4</issue>
          ),
          <fpage>213</fpage>
          ‒
          <lpage>219</lpage>
          (
          <year>1995</year>
          ).
          <source>doi:10.1177/104438949507600402</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. da Silva,
          <string-name>
            <given-names>L.E.B.</given-names>
            ,
            <surname>Melton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.M.</given-names>
            ,
            <surname>Wunsch</surname>
          </string-name>
          <string-name>
            <surname>II</surname>
          </string-name>
          , D.C.
          <article-title>: Incremental Cluster Validity Indices for Hard Partitions: Extensions and Comparative Study</article-title>
          . arXiv:
          <year>1902</year>
          .
          <article-title>06711 [cs</article-title>
          .LG]. https://arxiv.org/pdf/
          <year>1902</year>
          .06711.
          <string-name>
            <surname>pdf</surname>
          </string-name>
          (
          <year>2019</year>
          ). Accessed 25 Oct 2019
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dasgupta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>P.M.:</given-names>
          </string-name>
          <article-title>Performance guarantees for hierarchical clustering</article-title>
          .
          <source>Journal of Computer and System Sciences</source>
          <volume>70</volume>
          (
          <issue>4</issue>
          ),
          <fpage>555</fpage>
          ‒
          <lpage>569</lpage>
          (
          <year>2005</year>
          ). doi:
          <volume>10</volume>
          .1016/j.jcss.
          <year>2004</year>
          .
          <volume>10</volume>
          .006
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Derhach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Cyber-Addiction of Students Majoring in Computer Science</article-title>
          .
          <source>Science and Education</source>
          <volume>7</volume>
          ,
          <fpage>92</fpage>
          ‒
          <lpage>98</lpage>
          (
          <year>2016</year>
          ). doi:
          <volume>10</volume>
          .24195/
          <fpage>2414</fpage>
          -4665-2016-7-16
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Di</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahmed</surname>
            ,
            <given-names>H.O.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nandi</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          :
          <article-title>Internet addiction disorder detection of Chinese college students using several personality questionnaire data and support vector machine</article-title>
          .
          <source>Addictive Behaviors Reports</source>
          <volume>10</volume>
          ,
          <issue>100200</issue>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1016/j.abrep.
          <year>2019</year>
          .100200
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hsieh</surname>
          </string-name>
          , W.-H.,
          <string-name>
            <surname>Shih</surname>
          </string-name>
          , D.-H.,
          <string-name>
            <surname>Shih</surname>
          </string-name>
          , P.-Y.,
          <string-name>
            <surname>Lin</surname>
          </string-name>
          , S.-B.:
          <article-title>An Ensemble Classifier with Case-Based Reasoning System for Identifying Internet Addiction</article-title>
          .
          <source>International Journal of Environmental Research and Public Health</source>
          <volume>16</volume>
          (
          <issue>7</issue>
          ),
          <volume>1233</volume>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .3390/ijerph16071233
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hussain</surname>
            ,
            <given-names>R.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghazanfar</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azam</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naeem</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rehman</surname>
            ,
            <given-names>S.U.:</given-names>
          </string-name>
          <article-title>A performance comparison of machine learning classification approaches for robust activity of daily living recognition</article-title>
          .
          <source>Artificial Intelligence Review</source>
          <volume>52</volume>
          (
          <issue>1</issue>
          ),
          <fpage>357</fpage>
          ‒
          <lpage>379</lpage>
          (
          <year>2019</year>
          ).
          <source>doi:10.1007/s10462- 018-9623-5</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>8. ICD-11 for Mortality</article-title>
          and Morbidity Statistics (Version: 04/
          <year>2019</year>
          ):
          <article-title>6C51 Gaming disorder</article-title>
          . https://icd.who.int/browse11/l-m/en#/http://id.who.int/icd/entity/1448597234 (
          <year>2019</year>
          ).
          <source>Accessed 29 Aug 2019</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ji</surname>
          </string-name>
          , H.-M.,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , L.-Y.,
          <string-name>
            <surname>Hsiao</surname>
          </string-name>
          , T.-C.:
          <article-title>Real-time detection of internet addiction using reinforcement learning system</article-title>
          .
          <source>GECCO'19: Proceedings of the Genetic and Evolutionary Computation Conference Companion</source>
          , рр.
          <fpage>1280</fpage>
          ‒
          <lpage>1288</lpage>
          . ACM (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1145/3319619.3326882
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jur'eva L</surname>
          </string-name>
          .N.,
          <string-name>
            <surname>Bol</surname>
            'bot
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Ju</surname>
          </string-name>
          .:
          <article-title>Komp'juternaja zavisimost': formirovanie, diagnostika, korrekcija i profilaktika (Computer addiction: formation, diagnosis, correction and prevention)</article-title>
          . Porogi,
          <string-name>
            <surname>Dnepropetrovsk</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Keng</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The Expectation-Maximization Algorithm</article-title>
          . http://bjlkeng.github.io/posts/theexpectation-maximization-algorithm (
          <year>2016</year>
          ).
          <source>Accessed 25 Aug 2019</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Klochko</surname>
            ,
            <given-names>O.V.</given-names>
          </string-name>
          :
          <article-title>Matematychne modeliuvannia system i protsesiv v osviti/pedahohitsi (Mathematical modeling of systems and processes in education/pedagogy)</article-title>
          .
          <source>Vinnytsia</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Krämer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schreyögg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busse</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Classification of hospital admissions into emergency and elective care: a machine learning approach</article-title>
          .
          <source>Health care management science 22(1)</source>
          ,
          <fpage>85</fpage>
          ‒
          <lpage>105</lpage>
          (
          <year>2019</year>
          ).
          <source>doi:10.1007/s10729-017-9423-5</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Linoff</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berry</surname>
            ,
            <given-names>M.J.A.</given-names>
          </string-name>
          :
          <article-title>Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, 3rd edn</article-title>
          . John Wiley &amp; Sons, New York (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Moshtaghi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bezdek</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erfani</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leckie</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bailey</surname>
          </string-name>
          , J.:
          <article-title>Online Cluster Validity Indices for Streaming Data</article-title>
          . arXiv:
          <year>1801</year>
          .
          <article-title>02937 [stat</article-title>
          .ML]. https://arxiv.org/pdf/
          <year>1801</year>
          .02937.
          <string-name>
            <surname>pdf</surname>
          </string-name>
          (
          <year>2018</year>
          ). Accessed 25 Oct 2019
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Moshtaghi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bezdek</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erfani</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leckie</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bailey</surname>
          </string-name>
          , J.:
          <article-title>Online cluster validity indices for performance monitoring of streaming data clustering</article-title>
          .
          <source>International Journal of Intelligent Systems</source>
          <volume>34</volume>
          (
          <issue>4</issue>
          ),
          <fpage>541</fpage>
          ‒
          <lpage>563</lpage>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1002/int.22064
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Subasi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kevric</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Canbaz</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>Epileptic seizure detection using hybrid machine learning methods</article-title>
          .
          <source>Neural Computing and Applications</source>
          <volume>31</volume>
          (
          <issue>1</issue>
          ),
          <fpage>317</fpage>
          ‒
          <lpage>325</lpage>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1007/s00521-017-3003-y
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Wallis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Just Click No:
          <article-title>Talk Story about Dr</article-title>
          .
          <article-title>Ivan K. Goldberg and the Internet Addiction Disorder</article-title>
          . New Yorker Magazine. http://www.newyorker.com/magazine/1997/01/13/justclick-no (
          <year>1997</year>
          ). Accessed 25 Oct 2019
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Weka</surname>
          </string-name>
          3
          <article-title>- Data Mining with Open Source Machine Learning Software in Java</article-title>
          . https://www.cs.waikato.ac.nz/~ml/weka (
          <year>2019</year>
          ).
          <source>Accessed 17 Aug 2019</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>K.S.:</given-names>
          </string-name>
          <article-title>Caught in the Net: How to Recognize the Signs of Internet Addiction ‒ and a Winning Strategy for Recovery</article-title>
          . John Wiley &amp; Sons, New York (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>K.S.:</given-names>
          </string-name>
          <article-title>Internet addiction: The emergence of a new clinical disorder</article-title>
          .
          <source>CyberPsychology &amp; Behavior</source>
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <fpage>237</fpage>
          ‒
          <lpage>244</lpage>
          (
          <year>1998</year>
          ). doi:
          <volume>10</volume>
          .1089/cpb.
          <year>1998</year>
          .
          <volume>1</volume>
          .
          <fpage>237</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>