<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>R. Jędrzejczyk);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Analysis of selected algorithms for the classification of space objects*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Radosław Jędrzejczyk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katarzyna Kłeczek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Applied Mathematics, Silesian University of Technology</institution>
          ,
          <addr-line>Kaszubska 23, 44100 Gliwice</addr-line>
          ,
          <country country="PL">POLAND</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>9</fpage>
      <lpage>0009</lpage>
      <abstract>
        <p>Along with the rise of available astronomical data, captured from numerous facilities from around the world, a need for faster and more sophisticated data analysis methods emerges. Data captures from numerous observation of large quantities of object in the sky can reach large volumes very quickly, making it impossible for scientist to analyse by hand. This rises the need for fast and reliable automated methods of data processing, which can be found in computer science research. Leveraging algorithms used in different areas of research is crucial for processing information about celestial bodies. In this work, we apply machine learning methods from computer science domain into an astronomy problem. We lay out three different machine learning algorithms, along with their inner workings, and show how they can be applied to astronomy problems. We show how those algorithms can be used to speed up processing of large volumes of data, and how they can help scientists in classification of celestial bodies. We investigate how each algorithm performs and try to find the best performing one in the problem of classification of different objects, based on their characteristics.</p>
      </abstract>
      <kwd-group>
        <kwd>knn</kwd>
        <kwd>naive bayes</kwd>
        <kwd>decision trees</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>† These author contributed equally.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>In the beginning, we will need to transform our data into a convenient form. In the case of the
non-numerical data, we will simply map it to one by associating separate numbers for each
value. On the other hand, numerical data will be rescaled using min-max normalization.</p>
      <p>We will compare performance of different algorithms, given the task of classification of stellar
objects. For the comparison, we have chosen:
• KNN (K-Nearest Neighbors) classification.
• Decision tree model.</p>
      <p>• Naive Bayes.</p>
      <p>Mathematical Model for K-Nearest Neighbors (K-NN)
If we assume we have a training dataset consisting of  data points:</p>
      <p>where  is the feature vector for the -th point, and  is the class label (for classification) or value (for
regression).</p>
      <p>Then we can calculate a distance metric, typically using the Euclidean distance  between
two points  and  defined as:
where  and  are feature vectors of dimension .</p>
      <p>To classify a new point  , we compute the distances between  and all points in the training
set, then select  nearest neighbours and assign a class label based on the majority.</p>
      <p>The parameter  is a crucial hyperparameter in the KNN algorithm. A small  can lead to
overfitting, while a large  can lead to underfitting. The optimal value of  is often selected
using cross-validation methods.</p>
      <p>Algorithm 1: KNN Algorithm</p>
      <p>Data: Training data _, training classes _, class to be classified _ℎ, test data , algorithm constant ’k’
ℎ_</p>
      <p>Result: Predictions
1 for each  in  do
2  ← distance between  and each in _;
3 _ ← indexes of ℎ_ closest neighbours;
4  ← classes of closest neighbours;
5  ← dominating label in ;
6 Add  to prediction list;
7 Create data structure with predictions, by choosing indexes of the test data; return Predictions</p>
      <sec id="sec-2-1">
        <title>Mathematical Model for Decision Tree</title>
        <p>where  is the feature vector for the -th point, and  is the class label (for classification) or value (for
regression). Then a decision tree is a tree-like model where internal nodes represents a test on a
feature, branches represents outcomes of those tests and leaf node represents a class label.</p>
        <p>To build a decision tree, we recursively split the data at each node. The choice of split is
based on a criterion that maximizes the separation of the classes or reduces the prediction error.
Common criteria include:</p>
        <p>Gini Index:
where  is the proportion of instances of class  in the dataset .</p>
        <p>Information Gain:
where () is given by:
and  is the subset of  where attribute  has value  .</p>
        <p>Mean Squared Error (MSE):
where ¯ is the mean of the values in the dataset .</p>
      </sec>
      <sec id="sec-2-2">
        <title>Mathematical Model for Naive Bayes</title>
        <p>Assume we have a training dataset consisting of  data points:</p>
        <p>where  = (1, 2, . . . , ) is the feature vector for the -th point, and  is the class label from a set of
classes { 1,  2, . . . , }.</p>
        <p>The Naive Bayes algorithm is based on Bayes’ Theorem:
Algorithm 2: Decision Tree Algorithm</p>
        <p>Data: Training dataset , set of attributes  , class attribute</p>
        <p>Result: Decision tree
1 begin
2 Create a root node  ;
3 if all instances in  belong to the same class  then
4 Label  as leaf node with class ;
5 else
6 if  is empty then
7</p>
        <p>where:  ( |  ) is the posterior probability of class  given feature vector  ,  ( | )
is the likelihood of feature vector  given class ,  () is the prior probability of class ,
 ( ) is the evidence or marginal likelihood of feature vector  .</p>
        <p>The "naive" assumption is that the features are conditionally independent given the class
label:</p>
        <p>The goal is to predict the class label ^ for a new instance  by maximizing the posterior
probability:</p>
        <p>Using Bayes’ Theorem and the naive assumption, we can write:</p>
        <p>The probabilities  () and  ( | ) need to be estimated from the training data and the prior
probability of class  is estimated as:</p>
        <p>where  is the number of instances in class . For continuous features, a common approach is to
assume a Gaussian distribution:
where  and</p>
        <p>are the mean and variance of the feature  for class .</p>
        <p>Algorithm 3: Naive Bayes Algorithm</p>
        <p>Data: Training dataset , class attribute</p>
        <p>Result: Classifier model
1 begin
2 for each class  in  do
3 Calculate prior probability  ( );
4 for each attribute  do
5 Calculate conditional probability  (| );
6</p>
        <p>return Classifier model;</p>
        <p>
          Additionally, we will look for the best number of neighbours for KNN classifier. We will
use a few libraries to handle our operations: Sklearn [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]- will provide us with algorithm
implementations, saving us a lot of time and ensuring we will be able to go through relatively big
databases in reasonable time. Pandas [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] - will provide us with data structure (DataFrame). Seaborn
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and Matplotlib [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]- will be used for visualizations, graphs, etc.
        </p>
        <p>In order to find the best constant for KNN, we will launch classification in a simple loop,
looking for the best solution. Generally speaking, when this number will increase our accuracy
should decrease, therefore this approach is reasonable and should not take too much time.</p>
        <p>Algorithm 4: Loop for finding the best constant for KNN</p>
        <p>Data: Training data _, training classes _, class to be classified _ℎ, test data</p>
        <p>Result: Best constant
1 begin
2 Feed KNN algorithm with _ and _ data.;
3 Set KNN constant as 1.;
4 while KNN constant is lower than significant number do
5 Classify  using KNN.;
6 Check accuracy  and add it to the list  .;
7 Increase KNN constant by 1.;</p>
        <p>In the end, we present the confusion matrix for each of our solutions, and we will consider
only two metrics:
• Accuracy (Equation 15) - to measure how many correct classifications we get.
• False categorization - in order to check if any of the classes are more often confused with
others.</p>
        <p>Accuracy =</p>
        <p>Correct classifications</p>
        <p>All classifications
(15)
(a) Correlation matrix for SDSS-IV data.</p>
        <p>(b) Final correlation matrix for SDSS-IV data.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>
        For our dataset, we have chosen data from Sloan Digital Sky Survey DR17 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] (it was accessed
from [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]). Which was the fourth phase of the Sloan Digital Sky Survey (we will call it SDSS-IV
from now on). It contains 100000 observations, each containing (qouting [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]):
• obj_ID = Object Identifier, the unique value that identifies the object in the image catalogue
used by the CAS
• alpha = Right Ascension angle (at J2000 epoch)
• delta = Declination angle (at J2000 epoch)
• u = Ultraviolet filter in the photometric system
• g = Green filter in the photometric system
• r = Red filter in the photometric system
• i = Near Infrared filter in the photometric system
• z = Infrared filter in the photometric system
• run_ID = Run Number used to identify the specific scan
• rereun_ID = Rerun Number to specify how the image was processed
• cam_col = Camera column to identify the scanline within the run
• field_ID = Field number to identify each field
• spec_obj_ID = Unique ID used for optical spectroscopic objects (this means that 2 different
observations with the same spec_obj_ID must share the output class)
• class = object class (galaxy, star or quasar object)
• redshift = redshift value based on the increase in wavelength
• plate = plate ID, identifies each plate in SDSS
• MJD = Modified Julian Date, used to indicate when a given piece of SDSS data was taken
• fiber_ID = fiber ID that identifies the fiber that pointed the light at the focal plane in each
observation
(a)Histograms for all normalized data,
excluding redshift.
      </p>
      <p>(b) Histogram for normalised redshift.
(c) Number of different objects (0 - galaxies, 1 - QSOs,</p>
      <p>2 - stars).</p>
      <p>Some of that information will not be used for our classification, as they are contained in
SDSS-IV for cataloguing purposes (such as object identifiers). We will focus on: coordinates
alpha and delta; data from filtered channels u, g, r, i and z; class, which is the aim of our
classification efforts.</p>
      <p>After mapping and normalising our data in ultraviolet, green and infrared presented strange
pattern, where basically all data is accumulated near value 1.0. Upon further inspection it turns out
that one of the observed objects have some abnormal values (equals to -9999), we will
remove it from our dataset and then proceed.</p>
      <p>
        Now we will have a look at the correlation matrix (figure 1a) and address some of the relations:
• Coordinates have neutral relations with all the other data.
• Ultraviolet and green relation- green light is a part of spectrum of many stars similar
to the Sun (G-type main-sequence stars). Those stars also happens to emit significant
part of their radiation as ultraviolet. An additional effect, that can also explain moderate
relation with infrared and near infrared light is absorption and re-emission of different
by interstellar gas, which then re-emits in those wavelengths (heat radiation) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
• Infrared, near-infrared and red data have strong relation - red stars are typically colder, but they
still emit a lot of infrared radiation. Additional factor - absorption and re-emission of
light was mentioned above.
• Moderate relation of red, near infrared and infrared light with redshift can by explained by many
objects detected as red having their colour shifted due to phenomenons as Doppler effect. This
relation might be absent from other detectors, as light of stars different from infrared might
have been cut off by stardust or shifted strong enough to not be detected at all [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>In general, it is easy to notice strong relations with red and infrared light. This phenomenon
might be related to extinction of light in the space, which is more explicit for shorter wavelengths. The
coordinates of our objects are mostly related to each other (but it is still very weak relation). It
also has a pure neutral relation with most of the data from detectors, therefore we are going to
drop this one. Our final correlation matrix is shown for the sake of clarity in figure</p>
      <p>Additionally, we will provide histograms for SDSS-IV data, we will plot them on to one
histogram, excluding redshift, which will be shown separately for clarity (figures 2a and 2b).
We will also have a look at a number of each of the individual objects in our data (figure 2c)
we can notice significant dominance of galaxies. Galaxies and quasars are similar in number,
with a small margin for stars.</p>
      <p>We will split our data with at train and test set with ratio of 0.2. After running the calculation
mentioned in chapter before, we get:</p>
      <p>(a) Confusion matrix for KNN
(b) Confusion matrix for ID3
(c) Confusion matrix for Naive Bayes</p>
      <p>• For KNN we get 96.465% accuracy, which was best for numbers of neighbours equal to 3
as shown in figure 3 with confusion matrix as in figure 4a.
• Decision tree have achieved 96.78% accuracy (confusion matrix in figure 4b).
• Naive Bayes have achieved the lowest accuracy of 92.11% (confusion matrix in figure 4c)</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Behaviour of KNN accuracy was, as expected, decreasing with relation to its constant. On the
other hand, all the analysed algorithms achieved good accuracy (above 90%). Bayes algorithms
turned out to have had some problems distinguishing between galaxies and quasars (almost 1000
wrongly classified galaxies), although two others algorithms also struggled there. KNN seems
to deal the best with this problem, recognising even so slightly more QSO objects than others
but have more mismatches, recognising some of the galaxies as stars. None of the algorithms
have any problems recognising stars and rarely ever mismatches them.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vaccari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prescott</surname>
          </string-name>
          , T. Grobler,
          <article-title>Cnn architecture comparison for radio galaxy classification</article-title>
          ,
          <source>Monthly Notices of the Royal Astronomical Society</source>
          <volume>503</volume>
          (
          <year>2021</year>
          )
          <fpage>1828</fpage>
          -
          <lpage>1846</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Iess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cuoco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Morawski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nicolaou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lahav</surname>
          </string-name>
          ,
          <article-title>Lstm and cnn application for core-collapse supernova search in gravitational wave real data</article-title>
          ,
          <source>Astronomy &amp; Astrophysics</source>
          <volume>669</volume>
          (
          <year>2023</year>
          )
          <article-title>A42</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y. Z. Yanxia</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Astronomy in the big data era</article-title>
          ,
          <source>Data Science Journal</source>
          (
          <year>2015</year>
          ). doi:
          <volume>10</volume>
          . 5334/dsj-2015-011.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Gulliver</surname>
          </string-name>
          ,
          <article-title>Performance analysis and prediction for mobile internet-of-things (iot) networks: a cnn approach</article-title>
          ,
          <source>IEEE Internet of Things Journal</source>
          <volume>8</volume>
          (
          <year>2021</year>
          )
          <fpage>13355</fpage>
          -
          <lpage>13366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Szczotka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sikora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zielonka</surname>
          </string-name>
          ,
          <article-title>Fuzzy logic type-2 intelligent moisture control system</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>238</volume>
          (
          <year>2024</year>
          )
          <fpage>121581</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kęsik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Winnicka</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Woźniak, Strengthening the perception of the virtual worlds in a virtual reality environment</article-title>
          ,
          <source>ISA transactions 102</source>
          (
          <year>2020</year>
          )
          <fpage>397</fpage>
          -
          <lpage>406</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          ,
          <article-title>Soft trees with neural components as image-processing technique for archeological excavations</article-title>
          ,
          <source>Personal and Ubiquitous Computing</source>
          <volume>24</volume>
          (
          <year>2020</year>
          )
          <fpage>363</fpage>
          -
          <lpage>375</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Wickramasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kalutarage</surname>
          </string-name>
          ,
          <article-title>Naive bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation</article-title>
          ,
          <source>Soft Computing</source>
          <volume>25</volume>
          (
          <year>2021</year>
          )
          <fpage>2277</fpage>
          -
          <lpage>2293</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ukey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. Zhang,</surname>
          </string-name>
          <article-title>Survey on exact knn queries over high-dimensional data space</article-title>
          ,
          <source>Sensors</source>
          <volume>23</volume>
          (
          <year>2023</year>
          )
          <fpage>629</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <article-title>Package of scikit-learn</article-title>
          , https://scikit-learn.org/stable/,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -05-17.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Pandas</surname>
            <given-names>library</given-names>
          </string-name>
          , https://pandas.pydata.org/,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -05-17.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Seaborn</surname>
            <given-names>library</given-names>
          </string-name>
          , https://seaborn.pydata.org/,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -05-17.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Matplotlib</surname>
            <given-names>library</given-names>
          </string-name>
          , https://matplotlib.org/,
          <year>2023</year>
          . Accessed:
          <fpage>2024</fpage>
          -05-17.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>Original source of data release 17 from sloan digital sky survey</article-title>
          , https://www.sdss4.org/ dr17/,
          <year>2022</year>
          . Accessed:
          <fpage>2024</fpage>
          -05-18.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <article-title>Source of our data at kaggle</article-title>
          .com, https://www.kaggle.com/datasets/fedesoriano/ stellar
          <article-title>-classification-dataset-</article-title>
          <string-name>
            <surname>sdss17</surname>
          </string-name>
          ,
          <year>2022</year>
          . Accessed:
          <fpage>2024</fpage>
          -03-29.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Fedesoriano</surname>
          </string-name>
          ,
          <source>Stellar classification dataset - sdss17</source>
          ,
          <year>2022</year>
          . Retrieved May 18,
          <year>2024</year>
          from https://www.kaggle.com/fedesoriano/stellar
          <article-title>-classification-dataset-sdss17.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <article-title>Article about infrared imaging</article-title>
          , https://www.skyatnightmagazine.com/space-science/ infrared-astronomy,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -05-18.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>