1. Introduction

R. Jędrzejczyk);

Analysis of selected algorithms for the classification of space objects*

Radosław Jędrzejczyk

Katarzyna Kłeczek

0 0 Faculty of Applied Mathematics, Silesian University of Technology , Kaszubska 23, 44100 Gliwice , POLAND

000 9 0009

Along with the rise of available astronomical data, captured from numerous facilities from around the world, a need for faster and more sophisticated data analysis methods emerges. Data captures from numerous observation of large quantities of object in the sky can reach large volumes very quickly, making it impossible for scientist to analyse by hand. This rises the need for fast and reliable automated methods of data processing, which can be found in computer science research. Leveraging algorithms used in different areas of research is crucial for processing information about celestial bodies. In this work, we apply machine learning methods from computer science domain into an astronomy problem. We lay out three different machine learning algorithms, along with their inner workings, and show how they can be applied to astronomy problems. We show how those algorithms can be used to speed up processing of large volumes of data, and how they can help scientists in classification of celestial bodies. We investigate how each algorithm performs and try to find the best performing one in the problem of classification of different objects, based on their characteristics.

knn naive bayes decision trees

1. Introduction

† These author contributed equally.

2. Methodology

In the beginning, we will need to transform our data into a convenient form. In the case of the non-numerical data, we will simply map it to one by associating separate numbers for each value. On the other hand, numerical data will be rescaled using min-max normalization.

We will compare performance of different algorithms, given the task of classification of stellar objects. For the comparison, we have chosen: • KNN (K-Nearest Neighbors) classification. • Decision tree model.

• Naive Bayes.

Mathematical Model for K-Nearest Neighbors (K-NN) If we assume we have a training dataset consisting of data points:

where is the feature vector for the -th point, and is the class label (for classification) or value (for regression).

Then we can calculate a distance metric, typically using the Euclidean distance between two points and defined as: where and are feature vectors of dimension .

To classify a new point , we compute the distances between and all points in the training set, then select nearest neighbours and assign a class label based on the majority.

The parameter is a crucial hyperparameter in the KNN algorithm. A small can lead to overfitting, while a large can lead to underfitting. The optimal value of is often selected using cross-validation methods.

Algorithm 1: KNN Algorithm

Data: Training data _, training classes _, class to be classified _ℎ, test data , algorithm constant ’k’ ℎ_

Result: Predictions 1 for each in do 2 ← distance between and each in _; 3 _ ← indexes of ℎ_ closest neighbours; 4 ← classes of closest neighbours; 5 ← dominating label in ; 6 Add to prediction list; 7 Create data structure with predictions, by choosing indexes of the test data; return Predictions

Mathematical Model for Decision Tree

where is the feature vector for the -th point, and is the class label (for classification) or value (for regression). Then a decision tree is a tree-like model where internal nodes represents a test on a feature, branches represents outcomes of those tests and leaf node represents a class label.

To build a decision tree, we recursively split the data at each node. The choice of split is based on a criterion that maximizes the separation of the classes or reduces the prediction error. Common criteria include:

Gini Index: where is the proportion of instances of class in the dataset .

Information Gain: where () is given by: and is the subset of where attribute has value .

Mean Squared Error (MSE): where ¯ is the mean of the values in the dataset .

Mathematical Model for Naive Bayes

Assume we have a training dataset consisting of data points:

where = (1, 2, . . . , ) is the feature vector for the -th point, and is the class label from a set of classes { 1, 2, . . . , }.

The Naive Bayes algorithm is based on Bayes’ Theorem: Algorithm 2: Decision Tree Algorithm

Data: Training dataset , set of attributes , class attribute

Result: Decision tree 1 begin 2 Create a root node ; 3 if all instances in belong to the same class then 4 Label as leaf node with class ; 5 else 6 if is empty then 7

where: ( | ) is the posterior probability of class given feature vector , ( | ) is the likelihood of feature vector given class , () is the prior probability of class , ( ) is the evidence or marginal likelihood of feature vector .

The "naive" assumption is that the features are conditionally independent given the class label:

The goal is to predict the class label ^ for a new instance by maximizing the posterior probability:

Using Bayes’ Theorem and the naive assumption, we can write:

The probabilities () and ( | ) need to be estimated from the training data and the prior probability of class is estimated as:

where is the number of instances in class . For continuous features, a common approach is to assume a Gaussian distribution: where and

are the mean and variance of the feature for class .

Algorithm 3: Naive Bayes Algorithm

Data: Training dataset , class attribute

Result: Classifier model 1 begin 2 for each class in do 3 Calculate prior probability ( ); 4 for each attribute do 5 Calculate conditional probability (| ); 6

return Classifier model;

Additionally, we will look for the best number of neighbours for KNN classifier. We will use a few libraries to handle our operations: Sklearn [ 10 ]- will provide us with algorithm implementations, saving us a lot of time and ensuring we will be able to go through relatively big databases in reasonable time. Pandas [ 11 ] - will provide us with data structure (DataFrame). Seaborn [ 12 ] and Matplotlib [ 13 ]- will be used for visualizations, graphs, etc.

In order to find the best constant for KNN, we will launch classification in a simple loop, looking for the best solution. Generally speaking, when this number will increase our accuracy should decrease, therefore this approach is reasonable and should not take too much time.

Algorithm 4: Loop for finding the best constant for KNN

Data: Training data _, training classes _, class to be classified _ℎ, test data

Result: Best constant 1 begin 2 Feed KNN algorithm with _ and _ data.; 3 Set KNN constant as 1.; 4 while KNN constant is lower than significant number do 5 Classify using KNN.; 6 Check accuracy and add it to the list .; 7 Increase KNN constant by 1.;

In the end, we present the confusion matrix for each of our solutions, and we will consider only two metrics: • Accuracy (Equation 15) - to measure how many correct classifications we get. • False categorization - in order to check if any of the classes are more often confused with others.

Accuracy =

Correct classifications

All classifications (15) (a) Correlation matrix for SDSS-IV data.

(b) Final correlation matrix for SDSS-IV data.

3. Experiments

For our dataset, we have chosen data from Sloan Digital Sky Survey DR17 [ 14 ] (it was accessed from [ 15 ]). Which was the fourth phase of the Sloan Digital Sky Survey (we will call it SDSS-IV from now on). It contains 100000 observations, each containing (qouting [ 16 ]): • obj_ID = Object Identifier, the unique value that identifies the object in the image catalogue used by the CAS • alpha = Right Ascension angle (at J2000 epoch) • delta = Declination angle (at J2000 epoch) • u = Ultraviolet filter in the photometric system • g = Green filter in the photometric system • r = Red filter in the photometric system • i = Near Infrared filter in the photometric system • z = Infrared filter in the photometric system • run_ID = Run Number used to identify the specific scan • rereun_ID = Rerun Number to specify how the image was processed • cam_col = Camera column to identify the scanline within the run • field_ID = Field number to identify each field • spec_obj_ID = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class) • class = object class (galaxy, star or quasar object) • redshift = redshift value based on the increase in wavelength • plate = plate ID, identifies each plate in SDSS • MJD = Modified Julian Date, used to indicate when a given piece of SDSS data was taken • fiber_ID = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation (a)Histograms for all normalized data, excluding redshift.

(b) Histogram for normalised redshift. (c) Number of different objects (0 - galaxies, 1 - QSOs,

2 - stars).

Some of that information will not be used for our classification, as they are contained in SDSS-IV for cataloguing purposes (such as object identifiers). We will focus on: coordinates alpha and delta; data from filtered channels u, g, r, i and z; class, which is the aim of our classification efforts.

After mapping and normalising our data in ultraviolet, green and infrared presented strange pattern, where basically all data is accumulated near value 1.0. Upon further inspection it turns out that one of the observed objects have some abnormal values (equals to -9999), we will remove it from our dataset and then proceed.

Now we will have a look at the correlation matrix (figure 1a) and address some of the relations: • Coordinates have neutral relations with all the other data. • Ultraviolet and green relation- green light is a part of spectrum of many stars similar to the Sun (G-type main-sequence stars). Those stars also happens to emit significant part of their radiation as ultraviolet. An additional effect, that can also explain moderate relation with infrared and near infrared light is absorption and re-emission of different by interstellar gas, which then re-emits in those wavelengths (heat radiation) [ 17 ]. • Infrared, near-infrared and red data have strong relation - red stars are typically colder, but they still emit a lot of infrared radiation. Additional factor - absorption and re-emission of light was mentioned above. • Moderate relation of red, near infrared and infrared light with redshift can by explained by many objects detected as red having their colour shifted due to phenomenons as Doppler effect. This relation might be absent from other detectors, as light of stars different from infrared might have been cut off by stardust or shifted strong enough to not be detected at all [ 17 ].

In general, it is easy to notice strong relations with red and infrared light. This phenomenon might be related to extinction of light in the space, which is more explicit for shorter wavelengths. The coordinates of our objects are mostly related to each other (but it is still very weak relation). It also has a pure neutral relation with most of the data from detectors, therefore we are going to drop this one. Our final correlation matrix is shown for the sake of clarity in figure

Additionally, we will provide histograms for SDSS-IV data, we will plot them on to one histogram, excluding redshift, which will be shown separately for clarity (figures 2a and 2b). We will also have a look at a number of each of the individual objects in our data (figure 2c) we can notice significant dominance of galaxies. Galaxies and quasars are similar in number, with a small margin for stars.

We will split our data with at train and test set with ratio of 0.2. After running the calculation mentioned in chapter before, we get:

(a) Confusion matrix for KNN (b) Confusion matrix for ID3 (c) Confusion matrix for Naive Bayes

• For KNN we get 96.465% accuracy, which was best for numbers of neighbours equal to 3 as shown in figure 3 with confusion matrix as in figure 4a. • Decision tree have achieved 96.78% accuracy (confusion matrix in figure 4b). • Naive Bayes have achieved the lowest accuracy of 92.11% (confusion matrix in figure 4c)

4. Conclusion

Behaviour of KNN accuracy was, as expected, decreasing with relation to its constant. On the other hand, all the analysed algorithms achieved good accuracy (above 90%). Bayes algorithms turned out to have had some problems distinguishing between galaxies and quasars (almost 1000 wrongly classified galaxies), although two others algorithms also struggled there. KNN seems to deal the best with this problem, recognising even so slightly more QSO objects than others but have more mismatches, recognising some of the galaxies as stars. None of the algorithms have any problems recognising stars and rarely ever mismatches them.

[1]

Becker ,

Vaccari ,

Prescott , T. Grobler, Cnn architecture comparison for radio galaxy classification , Monthly Notices of the Royal Astronomical Society 503 ( 2021 ) 1828 - 1846 .

[2]

Iess ,

Cuoco ,

Morawski ,

Nicolaou ,

Lahav , Lstm and cnn application for core-collapse supernova search in gravitational wave real data , Astronomy & Astrophysics 669 ( 2023 ) A42 .

[3]

Y. Z. Yanxia

Zhang , Astronomy in the big data era , Data Science Journal ( 2015 ). doi: 10 . 5334/dsj-2015-011.

[4]

Xu ,

Wang ,

Li ,

Cai ,

Tao ,

T. A.

Gulliver , Performance analysis and prediction for mobile internet-of-things (iot) networks: a cnn approach , IEEE Internet of Things Journal 8 ( 2021 ) 13355 - 13366 .

[5]

Woźniak ,

Szczotka ,

Sikora ,

Zielonka , Fuzzy logic type-2 intelligent moisture control system , Expert Systems with Applications 238 ( 2024 ) 121581 .

[6]

Połap ,

Kęsik ,

Winnicka , M. Woźniak, Strengthening the perception of the virtual worlds in a virtual reality environment , ISA transactions 102 ( 2020 ) 397 - 406 .

[7]

Woźniak ,

Połap , Soft trees with neural components as image-processing technique for archeological excavations , Personal and Ubiquitous Computing 24 ( 2020 ) 363 - 375 .

[8]

Wickramasinghe ,

Kalutarage , Naive bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation , Soft Computing 25 ( 2021 ) 2277 - 2293 .

[9]

Ukey ,

Yang ,

Li ,

Zhang ,

Hu , W. Zhang, Survey on exact knn queries over high-dimensional data space , Sensors 23 ( 2023 ) 629 .

[10] Package of scikit-learn , https://scikit-learn.org/stable/, 2024 . Accessed: 2024 -05-17.

[11] Pandas

library

, https://pandas.pydata.org/, 2024 . Accessed: 2024 -05-17.

[12] Seaborn

library

, https://seaborn.pydata.org/, 2024 . Accessed: 2024 -05-17.

[13] Matplotlib

library

, https://matplotlib.org/, 2023 . Accessed: 2024 -05-17.

[14] Original source of data release 17 from sloan digital sky survey , https://www.sdss4.org/ dr17/, 2022 . Accessed: 2024 -05-18.

[15] Source of our data at kaggle .com, https://www.kaggle.com/datasets/fedesoriano/ stellar -classification-dataset- sdss17 , 2022 . Accessed: 2024 -03-29.

[16] Fedesoriano , Stellar classification dataset - sdss17 , 2022 . Retrieved May 18, 2024 from https://www.kaggle.com/fedesoriano/stellar -classification-dataset-sdss17.

[17] Article about infrared imaging , https://www.skyatnightmagazine.com/space-science/ infrared-astronomy, 2024 . Accessed: 2024 -05-18.