<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1109/TPAMI.2024.3382294</article-id>
      <title-group>
        <article-title>Improving Navigation Systems with Computer Vision Approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Artem Panchenko</string-name>
          <email>artem.panchenko@karazin.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denys Drobin</string-name>
          <email>drobin2022mf11@student.karazin.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kyrylo Rukkas</string-name>
          <email>rukkas@karazin.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasiia Morozova</string-name>
          <email>a.morozova@karazin.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lyudmyla Polyakova</string-name>
          <email>l.yu.polyakova@karazin.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Zaretska</string-name>
          <email>zaretskaya@karazin.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>V.N. Karazin Kharkiv National University</institution>
          ,
          <addr-line>4 Svobody sq., Kharkiv, 61101</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>34</volume>
      <fpage>657</fpage>
      <lpage>668</lpage>
      <abstract>
        <p>This work presents an analysis of existing assistive technologies for people with visual impairments and introduces a specialized mobile application designed to ensure safe navigation in indoor environments. The study provides an overview of computer vision methods and semantic segmentation techniques applied to navigation tasks for people with disabilities. To estimate safe walking distances, a real-time depth estimation algorithm was proposed. The developed methods were integrated into a mobile application and tested under real-world indoor conditions. The accuracy of the system was evaluated through experiments conducted across various locations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;computer vision</kwd>
        <kwd>navigation</kwd>
        <kwd>software development</kwd>
        <kwd>neural networks</kwd>
        <kwd>semantic segmentation</kwd>
        <kwd>image processing</kwd>
        <kwd>image segmentation model</kwd>
        <kwd>database</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Vision, as the most dominant human sense, is essential for virtually all aspects of daily life. While often
taken for granted, the absence or impairment of vision significantly hinders the ability to learn, move
independently, read, and participate fully in educational and professional activities. Visual impairment
arises when an ocular condition disrupts the normal functioning of the visual system. It is estimated that,
over a lifetime, nearly every individual will experience at least one eye condition requiring appropriate
medical attention.</p>
      <p>
        According to research conducted by the World Health Organization [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], approximately 2.2 billion
people worldwide experience various forms of visual impairment. Among them, roughly 1.1 billion
individuals have conditions that afect their ability to see clearly and distinguish objects at a distance.
Notably, around 200 million people sufer from conditions that prevent them from navigating safely
and independently.
      </p>
      <p>
        To facilitate spatial orientation, individuals with visual impairments typically rely on the use of a
white cane [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This tool enables users to detect obstacles along their path, such as curbs, stairs, or
objects on the floor, and also signals to others that the user has a visual impairment. However, the
white cane has several limitations that can significantly afect both safety and overall user experience:
1. Limited range of detection – the cane can only detect objects within immediate proximity, which
restricts the user’s ability to plan longer routes.
2. Incomplete detection of obstacles – low-lying or elevated obstacles, such as objects at head level
or narrow protrusions, may go unnoticed, potentially compromising safety.
3. Reduced efectiveness in complex or crowded environments – narrow corridors, stairways with
obstacles, and densely populated areas can hinder safe and eficient use of the cane.
4. Dependence on user skill – efective navigation with the cane requires specialized training;
without it, users may struggle to orient themselves eficiently.
      </p>
      <p>
        The present study focuses on addressing the aforementioned limitations associated with the use
of the white cane. We propose the development of a mobile application leveraging Computer Vision
technologies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which aims to enhance the mobility and safety of individuals with visual impairments
in the following ways:
1. The use of a smartphone camera for navigation removes the distance limitations inherent to
a white cane and enables the detection and avoidance of all obstacles, rather than only those
located on the floor.
2. A mobile application also facilitates eficient navigation in confined spaces, as it eliminates the
need for active manipulation of a cane with the hand.
3. Moreover, the challenge of learning to use a white cane—which typically requires specialized
facilities and professional supervision—is addressed. By leveraging a mobile application for
spatial orientation, visually impaired individuals can independently develop navigation skills
while maintaining a high level of safety, with the smartphone providing real-time information
about obstacles along the route.
      </p>
      <p>The foregoing considerations underscore the primary objective of this study: the design and
development of a mobile application specifically aimed at assisting individuals with visual impairments in
navigating indoor environments safely and eficiently. Enclosed spaces, including shopping centers,
ofice buildings, hospitals, and residential complexes, present a variety of potential hazards for untrained
individuals moving without adequate visual guidance. These hazards may include obstacles at diferent
heights, narrow passageways, staircases, and dynamic elements such as other people or moving objects.
In such contexts, traditional assistive tools, such as the white cane, provide only limited coverage and
require specialized training to use efectively. The proposed mobile application seeks to address these
limitations by integrating computer vision technologies to provide real-time information about the
surrounding environment, detect obstacles of varying sizes and positions, and guide users along safe
paths. By doing so, this solution aims to substantially improve both the autonomy and the safety of
visually impaired individuals in complex indoor settings, ofering a practical complement—or in some
cases an alternative—to conventional mobility aids.</p>
      <p>This paper should be regarded as both an introduction to the research area and a proof of concept
for a mobile application designed to facilitate navigation for individuals with visual impairments. The
paper is structured as follows:
• Section 3 presents the advantages and methodological approaches for the practical implementation
of a contemporary solution to computer vision tasks, specifically focusing on Image Semantic
Segmentation.
• Section 4 details the architectural design and software engineering decisions underlying the
developed application.
• Section 5 demonstrates the application’s usage in practice, illustrating its operational workflow,
user interaction, and functional capabilities.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Adaptation of Mobile Devices for Visually Impaired Users</title>
        <p>Modern smartphones are equipped with a variety of accessibility features by default, among which
screen readers play a central role. These tools provide an audio-based interface for interacting with the
device and are comprehensive solutions ofered by both iOS and Android operating systems. Screen
readers operate by converting visual information into audio output, allowing users to navigate the
interface through a combination of voice feedback and gesture-based commands.</p>
        <p>
          According to a study conducted by WebAIM, which surveyed nearly 2,000 visually impaired
individuals, of whom 76.6% were completely blind, almost 90% of respondents rely on smartphones equipped
with screen readers as their primary means of communication and information access [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>Despite their efectiveness, current screen reader systems have notable limitations. They are primarily
designed for digital content navigation and are less efective for real-world spatial awareness or obstacle
detection. Users must rely on traditional mobility aids, such as the white cane, for safe navigation in
unfamiliar environments.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Be My Eyes</title>
        <p>
          Be My Eyes is a platform that leverages real-time video communication to connect visually impaired
users with a network of sighted volunteers [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The application assists users in reading text, navigating
environments, and identifying objects by having volunteers provide audio descriptions of the visual
input captured through the user’s smartphone camera.
        </p>
        <p>A key advantage of this approach lies in its high accuracy and contextual understanding, as human
volunteers can interpret subtle details and nuances that automated systems may miss. This allows for
more precise and situation-specific guidance compared to purely algorithmic solutions.</p>
        <p>However, the platform also exhibits notable limitations. Its functionality depends on the availability
of volunteers and requires a constant Internet connection, which may restrict usability in areas with
limited connectivity or during periods of high demand. Additionally, reliance on human volunteers can
introduce variability in response time and guidance quality, which may afect the overall reliability of
the service.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Seeing AI (Microsoft)</title>
        <p>
          Seeing AI by Microsoft is a standalone application that employs artificial intelligence algorithms to
analyze the surrounding environment using a smartphone camera [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The application provides a wide
range of object and image recognition functions, including short text and document recognition, product
identification via barcodes, detection of major currency denominations, and more. These capabilities
ofer substantial benefits to visually impaired individuals, facilitating daily tasks and enhancing their
autonomy and independence.
        </p>
        <p>A key advantage of Seeing AI is its ability to operate in real time even with limited network access,
thanks to its neural network architecture optimized for local computations. Unlike human-assisted
systems such as Be My Eyes, Seeing AI functions independently of volunteers, ensuring user autonomy
and complete privacy.</p>
        <p>Despite these strengths, the system also presents notable limitations. Automated AI-based solutions
can struggle with complex contextual scenes, particularly in environments with intricate spatial
arrangements or unusual lighting conditions. Additionally, while the system provides consistent performance,
it may lack the nuanced contextual understanding that a human volunteer could provide in ambiguous
or unexpected situations.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Conclusions</title>
        <p>In summary, it can be observed that individuals with visual impairments widely utilize a variety of mobile
applications to enhance their quality of life, particularly in the areas of navigation and as a supplement or
alternative to the traditional white cane. Notably, there is an increasing adoption of artificial intelligence
techniques, especially computer vision, to address the challenges associated with safe and autonomous
mobility for visually impaired users. These technologies enable real-time environmental analysis,
obstacle detection, and contextual understanding, providing users with enhanced independence and
confidence in navigating both familiar and unfamiliar spaces.</p>
        <p>An analysis of the most popular applications on the market, such as Seeing AI and Be My Eyes,
allows for the identification of several key requirements for developing efective assistive products:
1. Ofline Functionality: The application should be capable of operating without a continuous
Internet connection. This requirement stems from the fact that users may encounter locations
where stable connectivity is unavailable, such as in subway systems, trains, or remote indoor
environments. Ofline operation ensures consistent accessibility and safety in these scenarios.
2. Focused Operational Context: It is essential to concentrate the application’s functionality on a
specific type of environment. Attempting to address too wide a spectrum of use cases can reduce
accuracy, primarily due to undertraining of computer vision models. By tailoring the application
to well-defined locations, the system can achieve higher precision in obstacle detection, scene
understanding, and overall user guidance.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Image Semantic Segmentation</title>
      <p>In the field of computer vision, several primary directions have emerged in image processing. One
of the earliest and simplest tasks is image classification, in which a model identifies the dominant
object within an image and provides an overall assessment of the scene. Classification with localization
extends this approach by not only identifying the object but also indicating its position within the scene
using a bounding box.</p>
      <p>
        A more complex task is multi-object classification, which involves detecting and categorizing multiple
objects within a single image. This challenge requires advanced neural network architectures, such as
YOLO [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or R-CNN [8], capable of handling overlapping and spatially complex objects. Additionally,
the Mask R-CNN algorithm not only classifies objects but also generates precise segmentation masks for
each individual instance. This instance-level segmentation is particularly important in scenarios where
objects of the same class overlap, as conventional bounding boxes alone are insuficient for accurate
scene understanding.
      </p>
      <p>Semantic segmentation [9] aims to classify each individual pixel in an image, partitioning the scene
into meaningful semantic regions. Unlike simpler tasks such as image classification, which assigns
a single label to the entire image, or object detection, which localizes objects using bounding boxes,
semantic segmentation provides precise identification of objects and their boundaries. By analyzing
each pixel and its interaction with surrounding pixels, this task becomes computationally intensive,
posing significant challenges for deployment on mobile devices with limited processing power and
energy constraints.</p>
      <p>Semantic segmentation appears to be the most suitable approach for identifying safe surfaces for
navigation for several key reasons. The task requires the detection of continuous safe surfaces, which
difers substantially from conventional object-level classification tasks. For visually impaired individuals
learning to navigate with a white cane, understanding the holistic structure of traversable surfaces,
including their boundaries and potential obstacles, is essential. This pixel-level comprehension enables
the creation of safer and more reliable navigation systems that provide real-time guidance in complex
indoor environments.</p>
      <p>In summary, semantic segmentation represents the most suitable computer vision approach for
solving the problem of safe navigation for individuals with visual impairments. Unlike classification
or object detection methods, which provide only coarse information about objects or their locations,
semantic segmentation delivers a detailed, pixel-level representation of the environment. This enables
the detection of continuous traversable surfaces, accurate boundary delineation, and the identification of
potential collision hazards. Such capabilities are essential for ensuring safety and reliability in assistive
navigation systems. Therefore, semantic segmentation should be regarded as the optimal technological
foundation for the development of applications aimed at enhancing independent mobility for visually
impaired users.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Application Implementation</title>
      <p>In this section, the detailed implementation of the proposed mobile application will be presented.
Building upon the previously outlined analysis of computer vision techniques, particular emphasis will
be placed on integrating semantic segmentation as the core technology for ensuring safe navigation. The
section will describe the system architecture, the choice of neural network models, and the optimization
strategies adopted to enable real-time performance on mobile devices with limited computational
resources. Additionally, attention will be given to the user interface design and accessibility features,
ensuring that the application can be efectively utilized by individuals with visual impairments in
real-world conditions.</p>
      <sec id="sec-4-1">
        <title>4.1. System Architecture Description</title>
        <p>The developed information system for the identification of safe mobility surfaces employs a
multilayered architecture that ensures eficient interaction among system components and optimal utilization
of the computational resources of a mobile device. The system architecture consists of four principal
layers: the presentation layer, the business logic layer, the data processing layer, and the machine
learning model layer. Each layer is responsible for a specific set of functionalities:
• Presentation Layer. The presentation layer is implemented through the MainActivity component,
which is responsible for user interaction and the visualization of analysis results. This component
incorporates user interface elements such as buttons for selecting images, capturing photos via
the camera, and display areas for both input and processed images. At this level, mechanisms
for handling permissions related to camera and storage access are implemented, alongside the
interaction logic with Android system components required for acquiring images.
• Business Logic Layer. The business logic layer is represented by the functions
processBitmapInternal and processSelectedImage, which receive images from the presentation layer and prepare them
for subsequent utilization at the data processing layer. At this stage, asynchronous processing
routines are employed, enabling eficient data handling without monopolizing the computational
resources of the mobile device.
• Data Processing Layer. The data processing layer comprises image transformation functions such
as rotateBitmap, which adjusts the orientation of the image, and bitmapToArray, which converts
the raw image representation into a structured data array. At this stage, preliminary image
preprocessing is performed to ensure the correct operation of the subsequent neural network
algorithms.
• Machine Learning Model Layer. The machine learning model layer is represented by the
OnnxSegmentationProcessor component, which encapsulates the interaction logic with the neural
network implemented in the ONNX format. This component performs semantic segmentation of
images in order to identify safe mobility surfaces. By abstracting the details of model execution,
the OnnxSegmentationProcessor ensures modularity and facilitates the integration of the trained
deep learning model into the mobile system architecture.</p>
        <p>The interaction between the system components is implemented according to the principle of a
unidirectional data flow. An image is first captured at the presentation layer and subsequently passed
through the business logic layer to the data processing layer, where it is transformed into the appropriate
format. The processed image is then analyzed at the machine learning model layer, after which the
results are propagated back through the business logic layer to the presentation layer for visualization
to the user. Data exchange between layers is facilitated through intermediary objects, such as Bitmap
for image representation and Array for numerical data structures.</p>
        <p>Owing to this architectural design, the system exhibits a high degree of modularity, allowing
individual components to be developed and enhanced independently without necessitating substantial
modifications to other parts of the system. Furthermore, the system ensures high performance even
on devices with limited computational resources through the utilization of standard Android platform
components and optimized libraries for image processing and machine learning.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Image Segmentation Model Training</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. Dataset preprocessing</title>
          <p>The initial stage of the system development involved the creation and preprocessing of a specialized
dataset. For this purpose, the ADE20K dataset was selected [10]. This dataset comprises a large collection
of images annotated with 150 object classes, where each class is associated with a specific color in the
image mask. For instance, class №1 (wall) corresponds to all pixels with the RGB value (1, 1, 1), class
№2 (building) corresponds to pixels with RGB (2, 2, 2), and so forth.</p>
          <p>A decision was made to reduce the number of classes from 150 to 20, focusing on objects most
commonly encountered in indoor environments. To achieve this, a data structure in the form of a
dictionary was created, grouping the original classes into 20 consolidated categories. Subsequently,
an algorithm was developed to transform input images, which receives the image in its numerical
representation.</p>
          <p>The numerical representation of an image is an array of numbers (three-dimensional for RGB)
that defines the image resolution and color depth. For instance, a 640 ×540 RGB image produces an
array of shape (640, 540, 3), where the last dimension indicates that each pixel contains three values
corresponding to the red, green, and blue channels. These values range from 0 to 255, determining the
intensity of each color channel and enabling the representation of any color.</p>
          <p>The algorithm for reclassifying image masks uses the dictionary to map all original class RGB values
to the corresponding new class values. Due to the reduction in the number of classes and their associated
values approaching zero, the overall color palette visually darkened. For eficient storage and handling of
the new image dataset, Apache Parquet format was selected, which maintains two columns containing
the original images and their corresponding new masks.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Training of the Image Segmentation Model</title>
          <p>After the creation and preparation of the dataset, the subsequent phase in the development of the
system involved training the image segmentation model. Several primary architectures are commonly
employed for image processing, among which U-Net [11], SegNet [12], and DeepLab [13] are most
frequently utilized.</p>
          <p>For this task, the SegFormer [14] architecture, was selected due to its demonstrated high eficiency
in semantic segmentation tasks and its suitability as the foundation for our NVIDIA/MIT-B3 model.
To facilitate model training, the dataset was partitioned into training, validation, and test subsets in a
70:15:15 ratio. This allocation ensures an optimal balance between efective learning and the ability to
objectively evaluate the model’s performance.</p>
          <p>Since the task of semantic segmentation requires per-pixel classification of images, specific data
transformation functions were implemented. For the training set, data augmentation techniques were
applied, including random horizontal flips, adjustments of brightness, contrast, and saturation, as well
as minor geometric distortions. These techniques enhance the model’s robustness to variations in input
data and help prevent overfitting.</p>
          <p>To optimize the model parameters, the AdamW [15] algorithm was employed, which is a modification
of the classical Adam optimizer with improved weight regularization. Training was conducted with
an initial learning rate of 1e-4 and a scheduler that reduced the learning rate as the loss function
stabilized. Analysis of the training and validation processes demonstrates a consistent improvement in
performance metrics throughout the entire training period. The presented graphs visually illustrate the
dynamics of key metrics and training parameters.</p>
          <p>The training loss curve (train/loss) exhibits a rapid initial decrease from approximately 3.0 to 1.5
during the first iterations, indicating a swift adaptation of the model to the data. Subsequently, a gradual
reduction in loss is observed, reaching values around 0.8, which reflects a consistent improvement
in segmentation quality. The final loss value stabilized at approximately 0.66. The learning rate
(train/learning_rate) was initialized at 0.0005 and gradually decreased to 0.0002 during the final stages of
training. This strategy enabled the model to eficiently converge to optimal parameters while avoiding
oscillations throughout the training process.</p>
          <p>The validation results demonstrate a consistent improvement in the model’s performance. The loss
function on the validation set (eval/loss) decreased from initial values of approximately 1.1 to a final
value of 0.7881, indicating good model generalization. Analysis of the curves shows that the model
training proceeded steadily, with gradual improvements across all key metrics. The model achieved a
balance between minimizing errors on the training set and maintaining generalization capability, as
evidenced by the decreasing loss values during both training and validation phases. The final model
exhibits consistently low loss values (train/loss = 0.66, eval/loss = 0.7881), reflecting a high quality of
semantic segmentation.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. System Functionality Description</title>
        <p>The system’s functionality is structured around three primary use-case scenarios, tailored to address
the needs of the target user group. It enables the capture of environmental images through two main
methods. First, the user can directly take a photograph using the device camera by selecting the “Capture
Image” button within the application interface. During this process, the system automatically requests
the necessary camera permissions if they have not yet been granted. Alternatively, the user may utilize
the “Select Image” button to choose a previously stored image from the device gallery. This functionality
is particularly advantageous for pre-examining specific locations prior to navigating them in real time.</p>
        <p>Once an image is captured, the system automatically processes it to identify safe walkable surfaces.
The system classifies various elements within the image, such as floors and other navigable areas, using
a neural network. Users with partial visual impairment can quickly discern safe zones for movement by
identifying the clearly marked semi-transparent green areas on the processed image. Following image
analysis, the system provides textual guidance indicating how to safely reach the farthest point within
the scene.</p>
        <p>The system enables users to compare the highlighted safe zones with the actual surrounding
environment by simultaneously displaying both the original and processed images on the device screen.
This feature is particularly valuable for users with partial visual impairment, as well as for assisting
personnel who may help fully blind users interpret the results efectively.</p>
        <p>During image analysis, the system displays a progress indicator to inform the user of the ongoing
processing status. In addition to the generated mask, detailed information derived from the analysis is
presented to allow for more precise adjustment and understanding of the application’s performance.
Displayed parameters include the resolution of both the original and processed images, the shape of the
input array, and the time required for processing, which can be used for statistical evaluation. This
information enables users to assess the accuracy and reliability of the analysis results.</p>
        <p>A key feature of the application is its capability to operate autonomously, analyzing images without
requiring an internet connection. This ensures user data confidentiality and enhances usability in
various environments, which is particularly critical for individuals with visual impairments who may
navigate diverse and unpredictable conditions.</p>
        <p>The system provides clear warnings and actionable recommendations to address issues that may
arise during its use. For instance, if camera access permission is denied, the system notifies the user
that granting permission is required to proceed. Similarly, if image processing fails, the system alerts
the user to the issue and suggests attempting an alternative image. This proactive feedback mechanism
enhances usability and ensures that users can efectively navigate potential obstacles while interacting
with the application.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Demonstration of Concept Feasibility</title>
      <p>Figure 6 illustrates the system’s performance in typical indoor environments. As shown in the presented
images, the system successfully segments the floor (highlighted in bright green) across various scenarios.
The upper pair of images demonstrates segmentation results where the floor is clearly distinguished
from other indoor objects, including furniture and walls. The lower images depict the original scenes
without the overlay of the segmentation mask.</p>
      <p>When deployed on the mobile device, the model exhibits sharper edges at the junctions between
diferent object classes, which is attributed to weight compression during the model transfer process. The
ifgure also shows a comparison of the original image and the segmentation mask highlighting the safe
surface for navigation, alongside directional instructions generated by the system, such as: “Move 0.5
meters to the right and walk straight 4.17 meters.” This instruction is the output of the find_optimal_path
algorithm. During testing on a dataset of 50 images from various indoor environments, the system
achieved an Intersection over Union (IoU) metric of approximately 70%, indicating a robust segmentation
accuracy.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>As a result of this work, a mobile application has been developed that leverages computer vision
technologies to assist visually impaired individuals in safely navigating indoor environments without
the use of a white cane.</p>
      <p>A key achievement of this study is the development of a surface recognition model for safe navigation,
achieving an approximate accuracy of 70%. In addition to the segmentation model, an algorithm for
distance estimation within images was implemented, allowing users to approximate the distance to the
farthest point along the intended path of movement.</p>
      <p>Testing the application in real indoor environments demonstrated high accuracy in identifying safe
pathways: over 90% in scenarios without small obstacles and over 70% when obstacles were present.
The system operates autonomously, processing images directly on the mobile device, thereby ensuring
data privacy and convenient use in a variety of conditions.</p>
      <p>However, the current implementation exhibits several limitations:
1. The application does not support voice commands, which significantly complicates user
interaction.
2. The system lacks the functionality to provide auditory instructions for safe navigation trajectories,
further limiting ease of use.
3. Users are required to manually capture images of the environment with the camera at intervals; a
more practical solution would involve real-time image processing to enable continuous navigation.
Addressing these limitations constitutes a primary direction for future research in this domain.</p>
      <p>It is also important to outline the following steps aimed at improving the performance of the
application and enhancing the overall quality of the research:
1. Limited test dataset - the current test set is relatively small and does not include class-based
metrics or confidence/uncertainty estimations. Future research will focus on expanding the test
dataset and implementing classification of partially visible objects
2. Distance estimation improvement - it is necessary to investigate more accurate methods for
determining distances between objects, as the current approach, which relies on the standard
human height and device tilt angle, remains rather approximate
3. Field testing and user evaluation - after the development of the MVP version of the application, it
is planned to publish it for real-world testing with blind and visually impaired participants to
assess usability, safety, and task performance
.</p>
      <p>It is also important to note that, as the application has been developed solely at the proof-of-concept
stage, it has not yet been tested with visually impaired individuals. Conducting such user trials represents
a critical priority for subsequent studies to assess both usability and efectiveness.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to Grammar and spelling
check. After using these tool, the authors reviewed and edited the content as needed and takes full
responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Blindness and vision impairment</article-title>
          , World Health Organization (
          <year>2023</year>
          ). URL: https://www.who.int/ news-room/fact-sheets/detail/blindness-and
          <article-title>-visual-impairment.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z. R.</given-names>
            <surname>Kahaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taherian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Simi</surname>
          </string-name>
          ,
          <article-title>Development and validation of a white cane use perceived advantages and disadvantages (wcpad) questionnaire</article-title>
          ,
          <source>BMC Psychol</source>
          .
          <volume>11</volume>
          (
          <issue>1</issue>
          ) (
          <year>2023</year>
          ).
          <source>doi:10.1186/s40359-023-01282-4.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Agnew</surname>
          </string-name>
          , M. Cheng, Computer-vision research powers surveillance technology.,
          <source>Nature</source>
          (
          <year>2025</year>
          )
          <fpage>73</fpage>
          -
          <lpage>79</lpage>
          . doi:
          <volume>10</volume>
          .1038/s41586-025-08972-6.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[4] Screen Reader User Survey 10 Results</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] Helping blind and low vision people</article-title>
          , Be My Eyes,
          <year>2025</year>
          . Available: https://www.bemyeyes.com/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Seeing</surname>
            <given-names>AI</given-names>
          </string-name>
          , Microsoft,
          <year>2025</year>
          . Available: https://www.microsoft.com/en-us/ai/seeing-ai.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ergu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <article-title>A review of yolo algorithm developments</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>199</volume>
          (
          <year>2022</year>
          )
          <fpage>1066</fpage>
          -
          <lpage>1073</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ S1877050922001363. doi:https://doi.org/10.1016/j.procs.
          <year>2022</year>
          .
          <volume>01</volume>
          .135,
          <source>the 8th International Conference on Information Technology and Quantitative Management (ITQM 2020</source>
          <year>2021</year>
          )
          <article-title>: Developing Global Digital Economy after COVID-19.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>