<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krzysztof Gumiński</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jakub Miarka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Applied Mathematics, Silesian University of Technology</institution>
          ,
          <addr-line>Kaszubska 23, 44100 Gliwice</addr-line>
          ,
          <country country="PL">POLAND</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IVUS2024: Information Society and University Studies 2024</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article focuses on the presentation and comparison of selected learning classifier algorithms for the problem of detecting individuals prone to suffering from mental disorders within a group of university students. The data used for detection purposes was easy to obtain, meaning that it does not contain any sensitive information and the amount of information is small. Algorithms used in the comparison are as follows: K Nearest Neighbors (kNN), Naive Bayes, Decision trees and Gradient Boosting. Three different metrics were used for the comparison: accuracy, recall and precision, the comparison was performed on both not normalized and normalized data. For the not normalized data, in terms of accuracy Gradient Boosting performed the best (having an accuracy of 80%), followed by kNN and Naive Bayes (accuracy of 70%), the worst ones was the Decision Trees (65%), in terms of recall, kNN and Naive Bayes performed the best (recall of 100%), followed by Gradient Boosting (85.7%) and Decision Trees (71.4%), considering precision, the best performance can be observed for Gradient Boosting (85.7%), followed by Decision Trees (76.9%) and Naive Bayes with kNN (70%). For the normalized data, a significant increase in kNN and a decrease for Decision Trees, in terms of performance, were observed.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;K Nearest Neighbors</kwd>
        <kwd>Naive Bayes</kwd>
        <kwd>Decision Trees</kwd>
        <kwd>Gradient boosting</kwd>
        <kwd>mental health</kwd>
        <kwd>machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, the public’s awareness of mental health problems has been steadily rising. As a
result, more and more people notice that the condition of society’s mental health is seriously
concerning. In the year 2021, in the United States of America alone - according to the National
Institute Of Mental Health, it is estimated that around 21 million (6.3% of the entire population) of
people suffered at least one major depressive episode and according to World’s Health
Organization, 1 in every 8 people live with a mental disorder. Taking into account previously
mentioned data from the National Institute of Mental Health, the age group that is the most
liable for mental disorders, are young people, with ages ranging from 18 to 25 - that is, mostly
university students.</p>
      <p>
        It is common knowledge that prevention is better than cure and that quick diagnosis of an
illness greatly reduces the time and effort needed for recovery, while also increasing the
chance of successful treatment. The same principles apply to mental illnesses as well. While
diagnosing can be done only by a trained specialist, it is essential that a person suffering can
be recognized - so they can get a specialist’s help later on. Such recognition can be made by
observing symptoms of an illness, but, what’s also important - some health conditions can be
predicted by recognizing certain characteristics common for affected patients [
        <xref ref-type="bibr" rid="ref1">1, 2, 3, 4, 5</xref>
        ], so
the people showing such characteristics can be recognized and given a specialist’s supervision as
a preventing measure - that is the approach we want to touch upon.
      </p>
      <p>In this paper, we aim to recognize potential patients with mental health problems among
university students using learning classifier algorithms and compare the performance of these
algorithms in terms of accuracy, recall and precision. The data about students used for making
such recognition is meant to be easily collectible. The classifiers used in the comparison are
the following: K Nearest Neighbors, Naive Bayes Classifier, Decision tree Classifier and gradient
booster classifier.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>The concept behind Learning Classifier Algorithms is to "teach" the algorithm with the use of
data about individuals, whose classification is known, for this paper it is whether the student has
any mental disorders or not. Descriptions of each algorithm in greater detail are provided in
the subsections below.</p>
      <sec id="sec-2-1">
        <title>2.1. K Nearest Neighbors</title>
        <p>K Nearest Neighbors [6, 7, 8], commonly referred to as kNN, is one of the most well-known
and simple classification methods, it is worth noticing that it does not require knowledge about
data distribution.</p>
        <p>The classification performed by kNN starts with calculating the distance between the classified
element and each element used for training the algorithm. The distance can be calculated using
various equations, such as Euclidean, Chebyshev or Manhattan distances. In this paper, we’ve
decided to use the Euclidean distance that can be described with the following equation:
where  is a classified element, is an element used for the algorithm’s training and  is a number
of characteristics each element has.</p>
        <p>When the distances are calculated, the next step - voting, begins. K elements, which distances
with the classified element are the smallest. Because it is known, to which class each of these
elements belongs, it’s possible to count occurrences of said classes within the k nearest elements. The
classifiedelement is assigned to the class which occurs the most.</p>
        <p>KNN might require some tuning to work optimally, namely - the number of k for which the
algorithm performs the best, may vary because of the element’s characteristics or the size of
the datasets.</p>
        <p>Algorithm 1: Pseudo code of kNN algorithm</p>
        <p>Data: classifiedelement , elements used for algorithm’s training 1, 2, ..., , number of elements
used for algorithms teaching</p>
        <p>Result: prediction of ’s class
1 for  = 1,  &lt;= ,  + + do
2 Calculate the distance between  and  using Equation(1).
3 Sort elements 1, 2, ...,  ascending by their distance from  calculated in previous step.
4 Check the classes of k’s first elements, save the most commonly occurring class as
.
5 Return .</p>
        <sec id="sec-2-1-1">
          <title>2.2. Naive Bayes</title>
          <p>Naive Bayes [9, 10, 11] is a probability-based classifier.As its name implies, Naive Bayes utilizes Bayes’
rule that is described by the following equation:
Where  (  ) is a posterior probability - probability after evidence’s consi|deration,  ( ) is a
prior probab|ility - probability before evidence’s consideration,  (  ) is a likelihood
probability of the evidence, given the belief is true and  ( ) is a marginal probability
probability of the evidence, regardless the circumstances.</p>
          <p>For the classifier’s case, posterior is a probability that the classified element is of a certain class,
while the prior is a probability that a random element out of elements used for the classifier’s
training is of said class. Prior probability can be calculated in the following way:
where  is one of the classes, to which an element can belong to  is a number of elements
belonging to class A within the training set and  is a number of elem|en|ts in the training set.// The
quotient of likelihood and marginal probability can be calculated using various formulas.
One of the most often used, which we have also decided to use in this paper is a probability
density function of normal distribution that is described by the following equation:
Where  is an element to be classified, is the mean of the class and  is the standard deviation of a
class.</p>
          <p>The quotient of likelihood and marginal is calculated by multiplying the results of the Gaussian
function for every characteristic that the class consists of.</p>
          <p>The classifier will classify the element to a class in which posterior probability is the highest.
Naive Bayes is called naive because it assumes that the attributes are conditionally independent
within the class, which is often not the case.</p>
          <p>It is also worth noticing, that the product of Gaussian probabilities might be a number that is
really close to zero, which may be problematic because of rounding errors computers make while
working with floating-point numbers. To prevent that, both prior probability and Gaussian
probabilities of each characteristic can be logarithmized, then posterior probability is a sum of
these probabilities.</p>
          <p>Algorithm 2: Pseudo code of naive Bayes algorithm</p>
          <p>Data: classified element, classes of abstraction , , , ..., size of algorithm’s training set 
Result: prediction of ’s class
1 for each class do
2 calculate prior probabilities using equation(3).
3 calculate Gaussian probability using equation(4), using  as function’s argument.
4 Combine the two to calculate posterior probability according to equation(2).
5 Return the class, which posterior probability is the highest.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.3. Decision trees</title>
          <p>Decision trees [12] are classifying algorithms that work by creating a tree-like structure. The
starting point of such a tree is called the root - it is the firstif condition that is checked for
the classified element, next, depending on the condition’s fulfillment element "travels" further
down the branches - that is, next if conditions are checked until it reaches one of the leaves
classes, to which element can be classified into. For each node, different that root, has only one
incoming path, while root has none. There are usually two, sometimes more depending on
implementation, paths coming from the root and branches, and none coming from the leaves
which is the reason for the leaves to be called terminal nodes.</p>
          <p>For optimal choice of the if conditions, the Gini coefficient might be used. The Gini coefficient,
also known as the Gini index, is a measure of statistical dispersion, it can be calculated with the
following equation:
Where  = 1, 2, ...,  is a set of observations, sorted, ascending and  is a number of said observations.</p>
          <p>The node is considered a leaf when its Gini coefficient is equal 0 or below a certain threshold.
The building of a tree using said coefficient begins by calculating the Gini index for each
characteristic and choosing the lowest one as the root. The process is repeated until all of the
paths are ended with a terminal node.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.4. Gradient boosting</title>
        <p>The last of the classifiersused in the comparison is a gradient boosting machine (GBM for
short) [13, 14, 15, 16]. The main idea behind GBMs is to create the new base-learners maximally
correlated with the negative gradient of loss function. The choice of loss function is up to the
researcher and should be chosen with regard to the solved problem. Said loss function will
indicate how good is the model for making predictions.</p>
        <p>GBMs use an ensemble of weaker learning models by iteratively learning from each of them in
order to create a strong learning model, hence they are called boosting.</p>
        <p>The creation of the model can be described with the following equation:
where ˆ is the estimate function (model),  is a predicted value,  () is the observed value and
 is the loss function.</p>
        <p>It can be summarized as looking for an estimate for which the loss function is the smallest. The
GBM used for comparison in this paper uses regression trees as the "weak" models on which it
iteratively learns.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>In this chapter steps of building the model will be discussed, as well as analyzing different
algorithms, by showing charts comparing algorithms on different metrics.</p>
      <sec id="sec-3-1">
        <title>3.1. Database description</title>
        <p>The database used in this paper was taken from kaggle.com - a popular, open-source website,
that provides a lot of public databases. Data for this base came from research, carried out in
July 2020. In total, there are 101 records, which were self-reported by students. There are 11
characteristics initially: Timestamp(date of research), Gender, Age, Course, Year of study,
CGPA(average grade), Marital status (married or not married), Do you have depression?, Do you
have Anxiety?, Do you have Panic attack?, Did you seek any specialist for a treatment?.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data cleaning</title>
        <p>One of the most important tasks in building a machine learning model is to provide a model
with valuable data. To accomplish this, there is a need for an appropriate database, which was
achieved in the previous step. The database is to be transformed into a format that fitsthe
machine learning model the best. To achieve this, some columns need to be dropped (for example:
Timestamp, it will not be necessary in this case). To avoid redundancy table can be transformed by
merging the four last questions into one column - "Problems". Considered algorithms,
discussed in the section above, prefer numeric data, so all the data are to be transformed into
numeric values. In the end, the data that the model will be provided with, cannot contain any
null values, so all records with null values are deleted.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Division description</title>
        <p>When the data are in the final format, model training begins. But before that, some data
manipulation needs to be done. First of all, the model will need a "target" variable, in our case,
that’d be a column named "Problems", which contains the sum of problems(If two or more
columns merged into "problems" contained a "yes", a person is considered to have mental health
problems). The model also needs "features", which are columns, using which, the model will
be able to calculate the output that is - "target". "Features" are all columns except "Problems"
columns, so in this case, it is six columns. For the machine learning process, it is essential to do one
more step in the division of data. Apart from the training model, the possibility of assessing the
quality is also important. To do this, part of the data must be allocated for tests, while the rest
is used for training. The most common division is 80 percent for training and 20 percent for
tests, which is also used in this paper. After this transformation, the data in this format is
provided into the model, in order to train models using algorithms mentioned described in
section 2.</p>
        <sec id="sec-3-3-1">
          <title>3.4. Analysis of models</title>
          <p>Models will be compared by using three different metrics: accuracy, recall and precision.
Accuracy is the most popular metric and it shows how often a classificationof an ML model is
correct overall. Precision calculates how often the model is correct when predicting the target
class. Recall shows whether a model can findall objects of the target class.</p>
          <p>Where  (True Positives) are cases correctly predicted as positive,  (True Negatives) are cases
correctly predicted as negative,  (False Negatives) are positive cases incorrectly predicted as
negative, and  (False Positives) are negative cases incorrectly predicted as positive.</p>
          <p>
            Data normalization is an option to improve model performance. Looking at the KNN
algorithm and above all its equations, it is worth noticing, that a huge number could have a huge
impact on model performance, which can lead to errors. To avoid operating on huge numbers,
normalization is used. In our case, data will be transformed to fitwithin the range of [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ],
which is a very typical range for this type of problem. To obtain this, the following formula is
used.
          </p>
          <p>Where  [] is the values in a specificcolumn of the data frame  . _ is the minimum value in the column.
_ is the maximum value in the column. Below, it is shown how KNN performs before data
normalization and after it, for the range of neighbors from 3 to 20.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.5. Comparing models</title>
        <p>While comparing different models with each other it is important to observe how they perform on
different metrics. Four algorithms mentioned in Sec. 2 are compared on charts, differing by
metrics: accuracy, recall and precision. Three charts below depict, how the model performs on
normalized data (see first row in Fig. 3). The second row in Fig. 3 charts show how models
perform (asses by accuracy, recall and precision) on non-normalized data that is, data
transformed using (10) formula.</p>
        <p>(a) Accuracy
(b) Recall
(c) Precision
(d) Accuracy
(e) Recall
(f) Precision</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>It is worth noticing how much the model, which used the KNN algorithm improved when
provided with normalized data. Accuracy increased by approximately 20 percent and recall by
approximately 10 percent. For other algorithms, rather table should not be normalized,
because after the normalization, these models perform with the same results or even worse.
Overall Gradient algorithm’s performance is the best, but it is to be expected, because of its
complexity, when compared with the other algorithms. Considering only "simple" algorithms
- these, which were implemented on our own, KNN would be the best choice, because of its
satisfactory performance (better than other algorithms, considering all metrics). KNN performs in the
best way using four neighbors, which can be seen on the first and second charts. The
experiments could be potentially extended in the future, by using neural networks, which can
improve results, or by providing the models with huge amounts of data, which would require
performing large-scale surveys among university students.
[2] K. Morita, S. Karashima, T. Terao, K. Yoshida, T. Yamashita, T. Yoroidaka, M. Tanabe,
T. Imi, Y. Zaimoku, A. Yoshida, et al., 3d cnn-based deep learning model-based explanatory
prognostication in patients with multiple myeloma using whole-body mri, Journal of
Medical Systems 48 (2024) 1–11.
[3] D. Połap, M. Woźniak, R. Damaševičius, R. Maskeliu¯ nas, Bio-inspired voice
evaluation mechanism, Applied Soft Computing 80 (2019) 342–357.
[4] L. A. Shoaib, S. H. Safii, N. Idris, R. Hussin, M. A. H. Sazali, Utilizing decision tree
machine model to map dental students’ preferred learning styles with suitable instructional
strategies, BMC Medical Education 24 (2024) 58.
[5] D. Połap, G. Srivastava, Neural image reconstruction using a heuristic validation
mechanism, Neural Computing and Applications 33 (2021) 10787–10797.
[6] G. Guo, H. Wang, D. Bell, Y. Bi, K. Greer, Knn model-based approach in classification,
in: On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE:
OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania,
Sicily, Italy, November 3-7, 2003. Proceedings, Springer, 2003, pp. 986–996.
[7] K. Prokop, Grey wolf optimizer combined with k-nn algorithm for clustering problem,</p>
      <p>IVUS 2022: 27th International Conference on Information Technology (2022).
[8] A. Jaszcza, Reducing the number of calculations in k-nn by class representatives atb voting,
Scholar’s Yearly Symposium of Technology, Engineering and Mathematics (SYSTEM 2021)
(2021).
[9] G. I. Webb, E. Keogh, R. Miikkulainen, Naïve bayes., Encyclopedia of machine learning 15
(2010) 713–714.
[10] A. Salazar, L. Vergara, E. Vidal, A proxy learning curve for the bayes classifier,Pattern</p>
      <p>Recognition 136 (2023) 109240.
[11] K. F. Sotiropoulou, A. P. Vavatsikos, P. N. Botsaris, A hybrid ahp-promethee ii onshore
wind farms multicriteria suitability analysis using knn and svm regression models in
northeastern greece, Renewable Energy 221 (2024) 119795.
[12] L. Rokach, O. Maimon, Decision trees, Data mining and knowledge discovery handbook
(2005) 165–192.
[13] A. Natekin, A. Knoll, Gradient boosting machines, a tutorial, Frontiers in neurorobotics 7
(2013) 21.
[14] Z. Mei, T. Zhao, X. Xie, Hierarchical fuzzy regression tree: A new gradient boosting
approach to design a tsk fuzzy model, Information Sciences 652 (2024) 119740.
[15] S. Deng, J. Su, Y. Zhu, Y. Yu, C. Xiao, Forecasting carbon price trends based on an
interpretable light gradient boosting machine and bayesian optimization, Expert Systems
with Applications 242 (2024) 122502.
[16] A. M. Mohammed, E. Onieva, M. Woźniak, Selective ensemble of classifierstrained on
selective samples, Neurocomputing 482 (2022) 197–211.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pleszczyński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zielonka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <article-title>Application of nature-inspired algorithms to computed tomography with incomplete data</article-title>
          ,
          <source>Symmetry</source>
          <volume>14</volume>
          (
          <year>2022</year>
          )
          <fpage>2256</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>