1. Introduction

Journal of Wisdom Political Science and Multidisciplinary Sciences

10.48550/arXiv.2407.12687

Avoiding Type I Errors in Image Processing with SIFT/BRISK-keypoints on Android Smartphones

Dmytro Zubov

Andrey Kupin

0 0 Kryvyi Rih National University , 11 Vitaly Matusevich St., Kryvyi Rih, 50027 , Ukraine 1 University of Central Asia , 125/1 Toktogul St., Bishkek, 720001, Kyrgyz Republic

2025

3688 24 26

Avoiding false-positive recognition of objects is a topical problem for specific areas, such as detecting traffic signs for visually impaired pedestrians, fire emergency signs inside buildings, and construction safety signs. Existing solutions show that the percentage of incorrectly recognized traffic signs can reach 25 % for smart vehicles. In this study, SIFT/BRISK-keypoints are employed to design the image descriptor. An experiment with ten images of crosswalk traffic signs and 90 other images (including different traffic signs) showed that the false positive rate is zero and the false negative rate equals 50 %. The implementation is based on the Java Android application with the possibility to correct the knowledge base in case of false alarms. Image analysis was performed on smartphones Doogee S96 Pro and Samsung M31 with an execution time of less than one second. The most likely prospect for further development of this study is the design of the set of image descriptors to improve the false negative rate avoiding type I errors at the same time.

eol>image processing type I error SIFT/BRISK-keypoint Android smartphones 1

1. Introduction

A zero false positive rate, i.e., type I error [1, 2], and minimization of false negative rate, i.e., type II error [1, 2], is the complex criterion employed in ad-hoc image processing projects, such as detecting traffic signs for visually impaired pedestrians, fire emergency signs inside buildings, and construction safety signs. Existing solutions show that the percentage of incorrectly recognized traffic signs can reach 25 % for smart vehicles [3]. Up-to-date real-life implementations also demand autonomous and low-energy solutions since Internet connections are often unstable, and the average ChatGPT request consumes about 0.34 watt-hours and about 0.32176 ml of water [4]. The ecological impact depends on the neural network models (NNMs) complex responses produce more CO2 emissions than simple responses, and NNMs that provide more accurate responses result in higher emissions [5]. Reasoning- compared to concise response models [5]. Aliya Rysbek, the research software engineer at Google DeepMind UK [6], pointed out at the KIT forum in Bishkek (Kyrgyz Republic) on 29th May 2025 that her team could recently save about 1 % of the energy consumed by some NNMs which is a huge step considering a tremendous number of requests processed by Google datacenters worldwide.

In this study, the autonomous and low-energy software was developed using Java Android mobile application and SIFT/BRISK-keypoints [7] (Scale-Invariant Feature Transform and Binary Robust Invariant Scalable Keypoints), that is the implementation of the edge computing principle [8]. Power efficacy is achieved by executing the performance-optimized code on the continuously running smartphone without transmitting the data wirelessly. The presented approach employs a unique image descriptor for every target object which is different from the previously developed method [7], where a 291-point pattern is applied. Initially, a multithreaded Java Android application takes a photo via CameraX library [9], and then the method . generates a new bitmap scaled to a maximum resolution of 500 pixels using bilinear filtering. From up to 700 keypoints detected by the SIFT method, keypoints are selected. Next, the BRISK binary descriptor is designed considering keypoints that are unique on the target image, and the distances to basic keypoints are calculated. Experiments conducted on the Doogee S96 Pro and Samsung M31 smartphones demonstrated that the execution time is less than one second, with a false positive rate of zero and a false negative rate of 50 %.

The remaining part of the paper proceeds as follows: Section 2 reviews relevant works in the context of the most cutting-edge computer vision techniques. It also introduces the proposed soft/hardware architecture. Section 3 outlines the problem setup and the experiment setup from image capturing to image matching. Section 4 presents a successful experiment conducted with 100 images. Results and discussion are presented in Section 5 and Section 6, respectively. Conclusions are summarized in Section 7.

2. Related works

Up-to-date image processing algorithms emphasize accuracy, interpretability, transparency, speed, and scalability while reducing computational costs [10]. Some of the most cutting-edge computer vision techniques are as follows: 1. Convolutional neural networks (CNNs) [11, 12]. 2. Vision transformers (ViTs) [13]. 3. Segmentation techniques [14]. 4. In addition, there are numerous other image processing techniques, such as generative adversarial networks, super-resolution algorithms, adaptive histogram equalization, and denoising algorithms [10].

Two-dimensional CNNs, such as those presented in [11, 12], are prevalent in image processing nowadays. They use convolution to detect patterns in images, and then classify them, detect objects, apply semantic segmentation, etc. Prior to CNN processing, images have preprocessing steps such as homogenization, normalization, and principal component analysis [12]. Basic CNN components are the convolution layer, pooling layer, activation function, batch normalization, dropout, and fully connected layer. The most common CNN models are AlexNet, ResNet, VGG, GoogleNet, Xception, Inception, DenseNet, and EfficientNet [13].

In contrast to CNNs, which depend on hierarchical feature extraction, ViTs analyze images as sequences of smaller patches, enabling them to capture contextual information and long-range dependencies, resulting in improved image recognition capacities [13].

Image segmentation divides an image into distinct regions based on certain characteristics [14, 15]. Techniques like U-Net architectures, Canny edge detection, and Mask R-CNNs provide efficient and precise solutions [10]. In this study, each image is segmented into two regions the object O and the background B [15]. Following the application of a segmentation algorithm, pixels or other image attributes are classified into either region O or B.

The above-stated image processing techniques face common challenges in real-life implementation [10]:

High computational cost: values can differ several times. Solution: computational complexity reduction (the code was performance-optimized in this study).

Overfitting diminishes the model's generalization ability, potentially leading to lower accuracy. Solution: data augmentation and regularization (thresholding is used to separate the object and background regions in this study). 3. Noise and distortion can compromise the accuracy of image processing algorithms. Solution: images should be filtered (bilinear and Gaussian filters are employed in this study). 4. Interpretability and transparency of some AI-driven image processing methods. Solution: non-AI driven image processing methods (image processing with SIFT/BRISK-keypoints is employed in this study). 5. Real-time processing constraints. Solution: high-performance soft-/hardware (multicore smartphones were utilized in this study). 6. Ethical and privacy concerns. Solution: autonomous systems (edge computing with a mobile

Java Android application was implemented in this study).

The growing demand for machine learning on mobile devices has led to the development of lightweight CNN models, such as MobileNet [11, 16], which are optimized for use with limited computational power and memory. MobileNet V1 employed depthwise separable convolutions, and MobileNet V2 improved upon this with inverted residual blocks, further enhancing efficiency. Later versions of MobileNet optimize performance for mobile CPUs. Data requirements are the key drawback of lightweight CNN models since they require a significant amount of data to be trained to achieve acceptable performance on mobile devices. Some projects, as presented in this study, lack large datasets. This limitation leads to a loss of accuracy and challenges in training and optimization.

In this study, the system requires a false positive rate of zero and a minimized false negative rate. Considering the challenges outlined and the latest image processing techniques, the proposed architecture of the project soft-/hardware is presented in Figure 1. In this prototype, the end-user interacts with the smartphone via the simple user interface based on the button element with an onClick listener [17]. The multithreaded Java Android application processes images captured by the smartphone camera using the CameraX library and SIFT/BRISK-keypoints [7]. To update the knowledge base, the mobile application should have the option to download the updated information from Internet resources, such as GitHub and Firebase. For this purpose, the study proposes the use of JSON data format [17] because of its lightweight nature and widespread use in mobile applications.

3. Methods

3.1.

Problem setup

The following notations are used in the following sections of the paper. Considering a grayscale image I(x, y) with n SIFT/BRISK-keypoints [7] with coordinates, i.e., pairs of real numbers (xn,yn), the

End-user User interface Photo with or without the target object Knowledge base (GitHub, Google Firebase) JSON Java Android application: Image processing Figure 1: The proposed architecture of soft-/hardware complex. Environment

In this study, algorithm A employs a score  that quantifies the difference between the object and background regions. This score is determined based on various conditions cj (jm, where m=m1+m2 is the number of conditions in algorithm A; m1 is the number of obligatory conditions; m2 is the number of optional conditions), equations ( 2 ) and ( 3 ), and pixel values of a grayscale image I(x, y). If 1{cj} represents the indicator function, which returns 1 if cj is true and 0 otherwise, the characteristic mob of the object region is calculated as follows: where m1 conditions in the product operation are obligatory, m2 conditions in the summation operation are optional.

Similar to mob (5), the characteristic mbg of the background region is calculated with given conditions cj as follows:

2: = {( , )| , ∈ }. { , } = ( ( , )). −1 ( 1 ) ( 2 ) ( 3 ) (4) (5) (6) (7)

Then, a score  is computed using a threshold value V:

The threshold value V is determined from the training data. Thus, algorithm A indicates the presence (=1) or absence (=0) of the target object in grayscale image I(x, y). 3.2.

Experiment setup

In this study, a group named Crosswalk of traffic signs Crosswalk right , Crosswalk left , and Zebra crossing (see Figure 2) is identified in the image using the standard representation officially accepted in the Kyrgyz Republic [18]. The "Crosswalk" group is a subset of the traffic signs intended for pedestrians. Avoiding type I errors in this object recognition is a crucial point in the spatial cognition of visually impaired people [7].

The core steps of the above-stated algorithm A are as follows:

Downsampling the image with bilinear filtering in a Java Android application. 3. Localization of SIFT/BRISK-keypoints. 4. Selection of SIFT/BRISK-keypoints that have stable positions across different SIFT octaves. 5. Designing the image descriptor based on selected SIFT/BRISK-keypoints and algorithm A (7). 6. Image matching.

4. Experiment Crosswalk left , Crosswalk right , and

4.1.

Selection of SIFT/BRISK-keypoints with stable positions across different SIFT octaves

The image capturing by the smartphone s camera and CameraX Android API, downsampling the image with bilinear filtering in Java Android application, and localization of SIFT/BRISK-keypoints are the algorithm steps, which are similar to those described in [7]. An example of the selection of SIFT/BRISK-keypoints on traffic sign Crosswalk left with stable position across different SIFT octaves is shown in Figure 3, where the first octave of size 340340 pixels contains 49 SIFT/BRISKkeypoints, the second octave 680680 pixels, 109 SIFT/BRISK-keypoints, and the third octave 13601360 pixels, 124 SIFT/BRISK-keypoints (fuchsia and turquoise colors are used for third/second/first and fourth/third/second DoG (Difference of Gaussians) functions, respectively). The values of the population standard deviations in the Gaussian blur operator are consistent with those presented in [7]. Only 700 keypoints that are closest to the center of the image, based on Euclidean distance ( 2 ), are considered.

Analysis of three SIFT octaves shows that 47 SIFT/BRISK-keypoints have stable positions on these SIFT octaves (see Figure 4). Some coordinates of SIFT/BRISK-keypoints are as follows (octave of size 13601360 pixels is used; the origin of coordinates is at top left; the axis x is horizontal; the axis y is vertical): (x0, y0)=(283, 1064), (x1, y1)=(353, 961), (x19, y19)=(116, 1086), (x31, y31)=(679, 111), (x46, y46)=(1243, 1086).

C) (B), octave 13601360 pixels (C). 6,25 4,24 34,36

37 13,38 27

40 32 5,26

14,41 12,39 19

15,42 1,21 3,23 9,30 10,33

17,44 0,20 2,22 7,28

Designing the image descriptor based on the selected SIFT/BRISKkeypoints and algorithm A (7)

In this study, the following descriptive visual attributes [19] are employed to design indicator functions 1{cj} using a few human-friendly text descriptions: 1. SIFT/BRISK-keypoints 19, 31, and 46 are the basic components that form a triangle with other

SIFT/BRISK-keypoints located inside. 2. The grayscale image has the average pixel value denoted as Iav. 3. Euclidean distances ( 2 ) between SIFT/BRISK-keypoints. 4. Lines ( 3 ) connect various SIFT/BRISK-keypoints.

The obligatory conditions cj (j<m1; m1=3 in this study) were formulated by the human expert as follows: 1. c0: the distances between SIFT/BRISK-keypoints 19-31, 19-46, and 31-46 should be greater than 100 pixels and equal to one another, with a relative standard deviation (RSD) of 0.05. 2. c1: the pixel values at SIFT/BRISK-keypoints 19, 31, and 46 should be greater than (Iav-20), indicating that the intensity must be light. 3. c2: the pixel values on the lines ( 3 ) connecting SIFT/BRISK-keypoints 19-31, 19-46, and 31-46 should be greater than (Iav-20). 5 % is the mistake allowed in c2.

The optional conditions cj (m1j<(m1+m2); m2=44 in this study) were formulated by a human expert as follows: 6. c8: the distances between SIFT/BRISK-keypoints 5-19, 5-31, and 5-46 should match the calculated distances on the template image (see Figure 3) with RSD=0.05. Additionally, the pixel value at the SIFT/BRISK-keypoint 5 should correspond to the relevant SIFT/BRISKkeypoint on the template image (it must be greater than (Iav-20) in this study). 44. c46: the distances between SIFT/BRISK-keypoints 45-19, 45-31, and 45-46 should match the calculated distances on the template image (see Figure 3) with RSD=0.05. Additionally, the pixel value at the SIFT/BRISK-keypoint 45 should correspond to the relevant SIFT/BRISKkeypoint on the template image (it must be less than (Iav-30) in this study).

The conditions c0-c46 were designed to specifically target the features of crosswalk signs (see Figure 2).

5. Results

In this study, a Java Android mobile application implements the proposed image processing method to detect a specific group of Kyrgyz traffic signs . An experiment with ten images of crosswalk traffic signs and 90 other images (including different traffic signs) showed a false positive rate of zero and a false negative rate of 50 %. Image analysis was performed on smartphones Doogee S96 Pro and Samsung M31 with an execution time of less than one second. All original color pictures/photos and their grayscale versions with keypoints generated by SIFT algorithm were uploaded to the Google Drive folder https://drive.google.com/drive/folders/15Dk27s8_2mIZnBsLNcq8j11WJbeFAFVP taken by co-author Dr. Dmytro Zubov. Figure 5 shows nine examples of pictures and photos with average pixel values used in the experiment. Figure 6 presents the grayscale images with SIFT/BRISK-keypoints for the initial data presented in Figure 5.

The analysis of the experimental results shows that the target image can be identified using two distinct rules:

Rule 1 (for pictures of traffic signs [18]): A threshold value of V equals 20 (the closest value of mob equals 36 for the traffic sign mbg equals 14 for uploaded in subfolder 12 on Google Drive).

Rule 2 (for photos taken by the smartphone): A threshold value of V equals 9 (the closest value of mob equals 9 for the traffic sign presented in Figure 5(G); the closest value of mbg equals 6 for the photo uploaded in subfolder 91 on Google Drive).

6. Discussion

During the discussion, the project was presented at the Department of Computer Science of the University of Central Asia and the Department of Computer Systems and Networks of the Kryvyi Rih National University, two questions were raised:

The segmentation technique was employed to process images. However, it was noted that CNNs are widely used nowadays. An explanation was provided regarding the interpretability and transparency of the applied approach, especially in safety-critical projects where understanding the decision-making process is essential.

The smartphone application was developed using Java programming language. Although this is a common choice for Android mobile applications, it was suggested to consider using the Kotlin programming language and the Jetpack Compose declarative paradigm to enhance code readability for other developers.

Additionally, the Gemini generative artificial intelligence chatbot [20] was requested to provide recommendations on what should be improved in the research paper (see Figure 7) on June 28, 2025. The response included several suggestions for improvement, categorized by different sections of the manuscript. Three general recommendations are as follows: 1. Consistency in Terminology: While you use "SIFT/BRISK-keypoints" throughout, sometimes "SIFT keypoints" are mentioned in isolation (e.g., in the experiment section, "Localization of SIFT/BRISK-keypoints" is followed by "Selection of SIFT/BRISK-keypoints"). Ensure consistency or clarify if there's a specific reason for the distinction at certain points. 2. Flow and Transitions: Some transitions between paragraphs and sections could be smoother.

Consider using more explicit linking phrases or sentences to guide the reader. 3. Conciseness: While the paper is generally well-structured, some sentences or phrases could be more concise. For example, instead of "The most likely prospect for further development of this study is the design of the set of image descriptors to improve the false negative rate avoiding type I errors at the same time", you could simplify it to "Future work will focus on designing a set of image descriptors to improve the false negative rate while maintaining zero Type I errors."

Some of the above-stated recommendations, such as paraphrasing sentences in conclusions, have already been considered. Other suggestions are discussed in references or are not critical in the presented study, and hence they can be taken into account in future work.

7. Conclusions

This study presents a new method of image processing with SIFT/BRISK-keypoints and descriptive visual attributes, implemented in the developed prototype of Java Android mobile application. The interpretability, transparency, and zero false positive rate of the applied approach are the key advantages.

The core steps of the image processing algorithm are as follows: 1. 2. 3. 4. 5.

Downsampling the image with bilinear filtering.

Localization of SIFT/BRISK-keypoints.

Selection of SIFT/BRISK-keypoints with stable positions across different SIFT octaves. Designing the image descriptor based on the selected SIFT/BRISK-keypoints and descriptive visual attributes.

6. Image matching.

Acknowledgements

This work and the research behind it received support from the universities where the authors conducted the study. The authors express their sincere gratitude to colleagues at the University of Central Asia and Kryvyi Rih National University who contributed to this project.

Declaration on Generative AI

In preparing this work, the authors employed the Grammarly writing assistant [21] for grammar and spelling errors, as well as the Gemini generative AI chatbot to discuss the results of the study. Following the use of these tools, the authors reviewed and edited the content. The authors take full responsibility for the content of this publication.

[1]

Shreffler; M. R. Huecker , Type I and Type II Errors and

Statistical

Power , 2023 . URL: https://www.ncbi.nlm.nih.gov/books/NBK557530/.

[2]

M. D.

Lieberman ,

W. A.

Cunningham , Type I and Type II Error Concerns in fMRI Research: Rebalancing the Scale . Social Cognitive and Affective Neuroscience 4 .4 ( 2009 ) 423 - 428 . doi: 10 .1093/scan/nsp052.

[3]

. . -Ready Traffic Sign Recognition Systems in Cars: A Test Field Study , Energies 14 .12 ( 2021 ) 3697 . doi: 10 .3390/en14123697.