Street Navigation for Visual Impairment using CNN and Transformer Models Hasan Ali1 , Faithful Chiagoziem Onwuegbuche1,2 1 National College of Ireland, Ireland. 2 SFI Centre for Research Training in Machine Learning (ML-Labs), University College Dublin, Ireland. Abstract This paper addresses the challenge of street navigation for individuals with visual impairments and explores the potential of Artificial Intelligence (AI) to enhance navigation safety and effectiveness. We evaluate the performance of state-of-the-art Computer Vision Object Detection models, focusing on accuracy and speed. The central question is whether Transformer-based Object Detection models outperform other models. We use the specialized dataset "Walking On The Road" adapted to include only relevant classes, to compare deep learning and Transformer models in pre-trained and fine-tuned states. Metrics used include Mean Average Precision (mAP) for accuracy and Average Inference Time in milliseconds for speed. Our results show that YOLO models surpass Transformer-based models in both accuracy and speed. In Phase 1, YOLOv8x achieved the highest mAP of 0.399 with an average inference time of 14ms, while Transformer-based DETR had a lower mAP of 0.344 and a significantly longer inference time of 818.2ms. In Phase 2, after fine-tuning, YOLOv8x again outperformed with an mAP of 0.471 compared to DETR’s 0.323. These findings indicate that YOLO models are more effective for street navigation applications, providing superior accuracy and speed for visually impaired individuals. Keywords Artificial Intelligence, Vision Impairment, Blindness, Computer Vision, Object Detection, 1. Introduction This paper aims to solve the problem of street navigation in the context of visual impairment. Street navigation is an important activity for the visually impaired people as it substantially contributes to increased public health and lower rates of chronic diseases [1] . We identify here a great opportunity to use Artificial Intelligence (AI) to increase the quality of life of visually impaired people. We define street navigation as walking in an outdoor street by foot, where the individual can navigate freely and identify if there are challenges on the street to avoid. Visual impairment represents a significant global health issue with considerable prevalence rates. According to a study conducted by Flaxman et al. [2], as of 2015, approximately 36 million individuals were affected by blindness, while an additional 217 million people experienced moderate to severe visual impairment. The global prevalence of visual impairment, including blindness, was estimated at 0.49% of the total population, with moderate to severe visual impairment affecting 2.9% of the global population. Visual impairment restricts the ability to engage in activities such as walking on the street. Furthermore, individuals with blindness encounter significant challenges in securing employment, often resulting in lower income levels and heightened poverty rates, which can have adverse societal implications and impact education and social advancement, leading to reduced quality of life. The aforementioned challenges are what this paper aims to solve by using Artificial Intelligence. To implement an AI-based solution, it is essential to identify the most appropriate model or ar- chitecture for this specific application. Previous research has explored suitable models for street navigation, with a focus on CNN-based models like YOLO and MobileNetv2. Other studies have exam- ined Transformer-based models, though not within the context of street navigation for individuals with visual impairments. This paper seeks to address this gap by conducting a comprehensive comparison of Transformer-based and CNN-based models for street navigation applications designed to assist visually impaired individuals. AICS’24: 32nd Irish Conference on Artificial Intelligence and Cognitive Science, December 09–10, 2024, Dublin, Ireland $ x22142291@student.ncirl.ie (H. Ali); faithful.onwuegbuche@ncirl.ie (F. C. Onwuegbuche) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings The key contributions of this paper are as follows: 1. Evaluate whether object detection models that utilize the Transformer architecture can be more effective (in accuracy and speed) than other state-of-the-art models in the context of street navigation for visually impaired people. 2. Introduce an enhanced version of the Walking On the Road Dataset (WOTRv2), specifically designed for object detection in street navigation applications for visual impairment. 3. Develop and implement a comprehensive methodology for the evaluation of object detection models, focusing on street navigation for visually impaired individuals. It is crucial to have a fast model to ensure that the feedback provided is timely. If the model detects and identifies objects with a delay, the information becomes less useful or even irrelevant. Therefore, we evaluated the performance of the selected models (to be discussed in the related work) in two main areas: accuracy and speed. In terms of accuracy, the primary focus of this paper is to assess the model’s ability to detect ground truth labels accurately using the Precision metric (Mean Average Precision), as it is vital to detect objects correctly rather than miss them, making precision more critical than Recall in this context. We also measured the speed of the model, its inference time in milliseconds (ms). This measurement allowed us to ascertain that the models can provide effective real-time feedback performance. 2. Related Work Navigation aids through the use of object detection machine learning models have been studied in the past as part of computer vision. The applications of this area have increased with the advent of autonomous car driving. Moreover, this area has the potential to be used for visually impaired people to help them navigate while walking in the street. This literature review looks at recent applications of object detection for the purpose of aiding visually impaired people, as well as deep-diving into the recent advancements of the object detection models and their architectures. 2.1. Visual Impairment Visual impairment, whether it manifests as blindness or moderate to severe impairment, substantially impacts individuals’ quality of life [3]. The economic impact associated with visual impairment is multifaceted, encompassing both direct costs, such as medical expenses, and indirect costs, including loss of productivity [2]. Another study showcases that the prevalence of vision impairment is concentrated among older demographics, which is expected to increase due to global aging [4]. While these studies provide a comprehensive overview of the global impact of visual impairment and suggest various mitigation strategies, they lack a direct connection to the technological solutions that are the focus of this review. Specifically, there is no discussion on how different technological interventions, particularly modern machine learning models, compare in addressing the needs of visually impaired individuals. Moreover, the lack of focus on how these technologies are integrated into real-world applications creates a gap that this paper aims to fill, especially in comparing newer models like Transformers with state-of-the-art methods. 2.2. Object Detection as Aid for Visually Impaired People Object detection systems have been employed to assist visually impaired people in navigating their environment, especially while walking, using various technologies such as sensors or smartphones that provide feedback through audio or tactile signals [5]. Other methods, including infrared, laser, GPS, or RFID-based technologies, have limitations in recognizing specific objects or providing detailed information [6]. Recent advancements in computer vision, particularly with the introduction of fuzzy logic and uncertainty-aware approaches, have improved these systems’ ability to handle complex, real-time scenarios [7]. This section explores previous research on deploying assistive technologies for visually impaired people using machine learning. Masud et al. developed a smart assistive system using a Raspberry Pi 4B, camera, ultrasonic sensors, and an Arduino. The system employed the Viola-Jones algorithm for face detection and TensorFlow’s Object Detection API trained on the COCO 2017 dataset. While it achieved 91% accuracy and enhanced user mobility and safety, it faced challenges like low-light conditions and out-of-frame objects. Although effective initially, the Viola-Jones algorithm is now outdated, and modern machine learning models could improve performance [8]. Islam et al. [9] found that MobileNetv2 combined with SSDLite was the most effective for real-life applications on embedded systems like Raspberry Pi, offering the best tradeoff between accuracy and computational power. In contrast, more accurate models like YOLOv4 and EfficientDet-D3 required more resources, making them less suitable for embedded devices. Acar et al. [10] proposed "SIGHT" a mobile application using YOLOv8 for object detection and MiDaS for depth estimation, enabling real-time navigation on smartphones. YOLOv8 was chosen for its processing speed and efficiency, achieving 0.547 mAP with a 228ms inference time on a mid-range smartphone, demonstrating successful implementation in a similar use-case to ours. In another study, Atitallah et al. [11] utilized the YOLOv5 model with enhancements like CSPNet backbone and improved data augmentation, achieving 0.8102 mAP with an 89 FPS frame rate after compression. These studies predominantly focus on CNN-based models like YOLO and MobileNetv2, overlooking the rapidly emerging Transformer-based models. Our work addresses this gap by comparing these models with other Transformer-based approaches, particularly for street navigation for visually impaired individuals. 2.3. Transformer Architecture in Object Detection The advancement of assistive technologies for visually impaired individuals has accelerated. A notable breakthrough is the implementation of Transformer architecture. Introduced in "Attention is All You Need" by Vaswani et al. [12], the Transformer architecture represented a shift in sequence processing. It relies entirely on the attention mechanism, eliminating recurrence and convolutions to improve efficiency. The architecture includes key components such as self-attention, multi-head attention, positional encoding, and a feed-forward network, which together enhance the ability to process sequences effectively. Building on this, Carion et al. [13] proposed the Detection Transformers (DETR) model, which introduces significant innovations by simplifying object detection into a direct set prediction task, eliminating traditional steps like non-max suppression and anchor generation. DETR uses a Transformer encoder-decoder architecture to capture global context and relationships within images, improving accuracy. DETR also incorporates features like bipartite matching loss and parallel processing, achieving competitive performance, particularly on large objects. Deformable DETR (D-DETR), introduced by Zhu et al. [14], addresses challenges such as slow convergence and small object detection by using sparse sampling and a two-stage detector for improved precision. D-DETR outperforms both DETR and Faster R-CNN, particularly on the COCO 2017 dataset. Real-Time DETR (RT-DETR) is an evolution of the DETR architecture designed for real-time object detection by Zhao et al. [15], crucial for applications like street navigation for visually impaired individuals. RT-DETR features a hybrid encoder, High-Quality Initial Queries, and flexible speed tuning, eliminating non-max suppression to reduce inference time. It achieves superior accuracy and speed, outperforming YOLOv5, v6, and v7. These studies and papers, while thorough in detailing Transformer-based models, lack a direct comparison with state-of-the-art non-Transformer models like YOLO in aiding visually impaired individuals. This paper aims to fill that gap by comparing these models in real-world applications. 2.4. YOLO Evolution and Architecture The YOLO (You Only Look Once) model, introduced by Redmon et al. [16], marked a significant advancement in object detection through the use of convolutional neural networks (CNNs) for feature extraction. Unlike earlier models like R-CNN, which relied on multi-stage processes, YOLO simplifies detection with a single network, treating object detection as a single regression problem that predicts bounding boxes in one evaluation, leading to faster performance. YOLO employs a grid-based prediction system where each grid cell predicts a bounding box and confidence score. However, YOLO has limitations, particularly in localizing smaller objects and handling multiple objects in close proximity. YOLOv2, or YOLO9000, introduced in 2017, improved real-time detection across 9,000 categories with enhancements like a high-resolution classifier, anchor boxes, and multi-scale training. It increased detection speed through direct location prediction and batch normalization, achieving a 0.768 mAP, surpassing Faster R-CNN with ResNet, which achieved 0.764, and operating at 67 FPS compared to R-CNN’s 5 FPS [17]. YOLO-World extended YOLO’s capabilities by supporting open vocabulary detection, leveraging vision-language pre-training, particularly the RepVL-PAN technique, to enhance the interaction between visual and linguistic data. This model, which uses the pre-trained CLIP text encoder, enabled detection beyond predefined categories by integrating vision-language modeling. YOLO-World achieved a zero- shot average precision of 0.354 on the LVIS dataset with 52 FPS, outperforming models like DETCLIP-T [18]. Terven et al. [19] discuss YOLOv8, which introduced several enhancements, including a C2f Module and an Anchor-free Model, along with a new loss function (Complete IoU), collectively improving accuracy and speed. These features make YOLOv8 a strong candidate for applications like street navigation for the visually impaired, representing the current state-of-the-art in object detection. Several applications have successfully implemented YOLO models in assistive systems for the visually impaired, including YOLOv7, PC-YOLO, and YOLOv8. Alsultan and Mohammad [20] highlight YOLOv7’s practical benefits in enhancing environmental interaction and independence. Xia et al. [21] proposed PC-YOLO, designed specifically for visual impairment, achieving better average precision with a 0.6% improvement over YOLOv7. Despite extensive documentation of YOLO’s evolution, there is a lack of comparison with Transformer- based models, which are becoming important benchmarks in object detection. This gap limits the understanding of how YOLO compares to these models, particularly in contexts like real-time street navigation for the visually impaired. Our paper aims to address this by evaluating and comparing the effectiveness of Transformer-based models against other state-of-the-art models, focusing on their applicability in visual impairment scenarios. 3. Methodology In this section, we outline the methodology followed and provide reasoning behind the choices made. This includes the models, dataset, and processes used to conduct the research. 3.1. Models The models evaluated were YOLO-based and DETR-based. The goal of this paper is to evaluate the effectiveness and efficiency of the Transformer architecture in comparison with other state-of-the-art models. Below are the DETR-based models implemented, both of which trained on COCO 2017 dataset and follow COCO labeling format when fine-tuned: 1. DETR: This model uses the Transformer and attention-based architecture and, when released, showed promising results; hence, we wanted to include it in this work and evaluate its performance. We utilized the ResNet-50 backbone. 2. RT-DETR: Real Time Detection Transformer (RT-DETR) is an enhanced implementation of the DETR model, built for increased speed and accuracy. We utilized the ResNet-50 backbone. Below are the YOLO-based models implemented, both of which pre-trained on COCO 2017 dataset, follow the YOLO labeling format, and are maintained by Ultralytics1 , a library focused on computer vision models: 1. YOLOv8: This model was chosen because it is the state-of-the-art model in the YOLO family. Several sizes of the YOLOv8 model were implemented: nano, small, medium, large, and x-large. 2. YOLOv8-Worldv2: This variant was also tested, which employed vision-language modeling for Open-Vocabulary Detection tasks. The sizes implemented were small, medium, large, and x-large. 3.2. Dataset We used two main datasets to conduct the experiments: 1. COCO 2017: Contains 80 common classes and follows the COCO labeling format. The models we utilized were already pre-trained on this dataset. Any further training does not aim to modify the existing weights with which the models were trained on. 2. WOTR (Walking On The Road): This dataset was created by Xia et al. [21] which also focused on vision impairment. Our paper heavily relies on Xia et al. [21]’s paper and especially the dataset. The WOTR dataset contains both COCO and non-COCO classes which are relevant to street navigation for visually impaired people. We split the classes in this dataset into 2 groups: 1) Phase 1: classes which exist in COCO, and 2) Phase 2: classes which don’t exist in COCO. All classes here were labeled in ground truth using axis-aligned bounding boxes using the PASCAL-VOC labeling format. Table 1 Classes of the WOTR dataset Phase 1 Classes Phase 2 Classes Person, Bicycle, Bus, Truck, Car, Motorcycle, Fire Tree, Reflective Cone, Crosswalk, Blind Road, Hydrant, Dog Pole, Warning Column, Roadblock, Litter Bin, Signs 3.3. Pre-Processing Due to the various datasets and formats utilized, data pre-processing was an extensive task. To use the WOTR dataset, we had to first pre-process it so that it contains accurate and relevant data for our purpose. The dataset originally came in PASCAL-VOC format and therefore, we did most of the pre-processing on the PASCAL-VOC format directly, and then exported it to the desired format based on the model architecture (different models natively support different formats). 3.3.1. Formatting 1. Images: All formats were in jpg. The image sizes were not uniform and vary. 2. Labels: The dataset came in PASCAL-VOC, and after pre-processing, we exported it to YOLO format. We opted to use the YOLO labeling format as the default labeling format as it is well supported and simple to use. Moreover, it is the default format for YOLO models. When necessary, we also used other labeling formats such as COCO format. 1 See [22] for more details. We used the YOLO label format for several key operations: exporting predictions for all YOLO models, exporting predictions during Phase 1 for DETR/RT-DETR models, and evaluating model metrics across all models by comparing ground-truth labels with predictions. Additionally, for Phase 2 training of the DETR/RT-DETR models, we used the COCO dataset label format to import ground-truth labels so that the model can be trained on the custom weights. 3.3.2. Data Preparation To prepare our dataset, we followed several key steps to ensure its quality and suitability for our research. We began with an audit, manually assessing the accuracy of the ground truth labels to meet our high standards. We next standardized and refined the baseline dataset, specifically by removing the "sign" class, which we identified as unreliable and irrelevant for our use case. The original WOTR dataset contained both stop signs and directional signs within this class, which presented issues. This grouping did not align with the COCO dataset’s class definitions, as COCO has a separate "stop sign" class that should not include directional signs. Additionally, the "sign" class offered minimal utility for our target users, further justifying its removal. As a result, we reviewed our PASCAL VOC labels, removed the "sign" class, and discarded any empty labels along with their corresponding images. We then consolidated certain classes within the WOTR dataset, such as merging "red light" and "green light" into the "traffic light" class, and "tricycle" into "bicycle," to enhance detection and create more meaningful training data. Additionally, we renamed several classes to ensure consistency with the COCO dataset, aligning names like "fire_hydrant" with the standard "fire hydrant". We do these operations to ensure consistency with the COCO dataset format and ensure each class has a meaningful number of instances to train the models. For phase-based datasets, we tailored each phase to include only relevant classes. Starting with the Phase 2 dataset (We start with the Phase 2 dataset because it is the large set between the two), we identified and retained the necessary classes, then derived the Phase 1 dataset by removing out-of-scope classes. Phase 1 uses "WOTR_v2_Phase1" dataset while Phase 2 uses "WOTR_v2_Phase2" dataset. The datasets were then split into Training, Validation, and Testing collections using an 80/10/10 ratio, with sci-kit learn2 ensuring randomness and bias reduction. Finally, we converted the labels from PASCAL-VOC to YOLO format, chosen for its native compatibility with the YOLO models and ease of conversion to other formats. The DETR architecture uses COCO format natively, which will be handled dynamically at a later stage described in next sections. These steps ensured our datasets were ready for predictions, fine-tuning, and performance evaluation. 3.3.3. Model Prediction and Training When evaluating the models, we ensured consistency and level-setting by utilizing the same hyper- parameters across all experiments. For prediction, the confidence level was set at 0.5, and the IOU (Intersection Over Union) threshold for Non-Maximum Suppression was set at 0.8. During training, the models were trained for 20 epochs with a batch size of 10. The learning rate was fixed at 0.0001, and we used the AdamW optimizer with a weight decay of 0.0001 to prevent over-fitting. For metrics calculation, the IOU threshold was consistently maintained at 0.8 to ensure uniformity in the evaluation process. 3.3.4. Post-Processing After performing DETR-based predictions in Phase 1, it was necessary to perform post-processing to align the class numbers with the original COCO mapping. This involved converting the class names, which were outputted as names rather than numbers, into the corresponding class numbers in the YOLO label files (e.g., converting ’person’ to ’0’). Similarly, in Phase 2, post-processing was required not 2 See [23] for more details. only to ensure that the class numbers match the original COCO mapping but also to align the filenames of each label with the original filenames. This involved two steps: first, correcting the class numbers to match the original class numbers in the map file to ensure that the evaluation script correctly interprets the data; and second, correcting the filenames from the raw output to the original filenames, which is crucial for our evaluation script to work (because it compared 2 labels with same name) 3.4. Evaluation Methodology and Metrics After the model completed its prediction processes, exported the results in YOLO label formats, and we performed the post-processing for DETR labels, a script was run to evaluate performance and calculate relevant metrics. Below is the evaluation methodology: 1. Data Parsing and Preparation: This included parsing YOLO .txt files to extract ground truth and predicted bounding boxes and converting YOLO box format to (x1, y1, x2, y2) corners. 2. IoU Calculation: Compute IoU for pairs of ground truth and predicted boxes. 3. Metric Computation: This included computing Precision, Recall, and F1 Score based on IoU values, calculating Average Precision (AP) for each class across various confidence thresholds, and computing Mean Average Precision (mAP) across multiple IoU thresholds. 4. Class-wise Metrics and Confusion Matrix: This included computing True Positives (TP), False Positives (FP), and False Negatives (FN) for each class, constructing Confusion Matrix, and computing False Negative Rate (FNR). The below metrics are calculated: 1. Evaluation Metrics a) Accuracy i. Mean Average Precision (mAP) at various IOU thresholds: n 1X mAP = APi (1) n i=1 Where: • APi is the Average Precision (AP) at the i-th IoU threshold. • n is the total number of IoU thresholds considered. ii. Precision and Recall: TP P recision = (2) TP + FP TP Recall = (3) TP + FN iii. False Negative Rate (FNR): FN FNR = (4) FN + TP iv. Counts of TP, TN, FP, and FN. b) Speed i. Average Inference Time: n 1X Avg Inference Time = ti (5) n i=1 Where: • ti is the inference time for the i-th sample. • n is the total number of inference samples considered. Table 2 Summary of mAP and Inference Time for Phase 1 and Phase 2 Phase Architecture Model Params (M) Average - mAP Average - Inference Time (ms) Phase 1 DETR DETR-Original 41.6 0.3449 818.2 Phase 1 DETR RTDETR 41.6 0.3801 560.2 Phase 1 YOLO yolov8l 43.7 0.3954 13.9 Phase 1 YOLO yolov8m 25.9 0.3771 11.7 Phase 1 YOLO yolov8n 3.2 0.2951 9.4 Phase 1 YOLO yolov8s 11.2 0.3412 9.6 Phase 1 YOLO yolov8x 68.2 0.3997 14.0 Phase 1 YOLO-WORLD yolov8l-worldv2 43.7 0.3675 18.3 Phase 1 YOLO-WORLD yolov8m-worldv2 25.9 0.3469 16.7 Phase 1 YOLO-WORLD yolov8s-worldv2 11.2 0.3033 15.0 Phase 1 YOLO-WORLD yolov8x-worldv2 68.2 0.3728 19.3 Phase 2 DETR DETR-Original 41.6 0.3236 Phase 2 DETR RTDETR 41.6 0.4387 Phase 2 YOLO yolov8x 68.2 0.4710 Phase 2 YOLO-WORLD yolov8x-worldv2 68.2 0.4681 Figure 1: Phase 1 Results which include mAP and speed performance 2. Secondary Metrics a) Intersection over Union (IOU): Measures the overlap between two bounding boxes: Area of Overlap IOU = (6) Area of U nion Please note, if we perform Phase 1 predictions, they must be benchmarked against Phase 1 Ground Truth labels, and respectively the same for Phase 2. 4. Evaluation 4.1. Phase 1 Analysis In Phase 1, we evaluate the models in their pre-trained state. We measure both accuracy and speed. Based on the experiment results, YOLOv8x emerged as the top performer in terms of mAP, achieving an mAP of 0.399. Several factors contributed to this result. Firstly, YOLOv8’s architecture is designed for maximum accuracy and speed, utilizing a single network to predict objects and bounding boxes, which minimizes computational requirements. Additionally, YOLOv8 employs advanced techniques such as anchor-free detection, streamlining the prediction process. Furthermore, YOLOv8x, being the largest model in the YOLOv8 family with 68.2 million parameters, benefits from extensive weight training, leading to superior performance in predictions. In comparison, YOLOv8-Worldv2 ranked second in performance, though it exhibited reduced overall accuracy. Notably, the anticipated benefits of its Open-Vocabulary detection architecture did not Figure 2: Phase 2 Results comparing mAP performance to model size (params) materialize in the results, indicating that this approach did not significantly enhance performance in this context. On the other hand, DETR-Original performed the worst in terms of both accuracy and speed, with an mAP of 0.34494 and an average inference time of 818.2 ms. Several factors contributed to this outcome. The architectural differences, particularly the use of the Transformer-based architecture, which is computationally intensive, may not be well-suited for real-time detection use cases like street navigation. Additionally, DETR utilized the ResNet-50 backbone, which, while effective, is smaller in size compared to other models evaluated, potentially limiting its performance. However, Real-Time DETR (RT-DETR) outperformed the original DETR in both accuracy and inference time, achieving an mAP of 0.38005 and a reduced inference time of 560.2 ms. This improvement can be attributed to several architectural enhancements. RT-DETR incorporates parallel decoding, which significantly improves inference time, and its optimized DETR architecture reduces computational overhead. Moreover, RT-DETR includes an enhanced attention mechanism and benefits from hardware acceleration, allowing for better utilization of modern GPUs and thus superior performance. The potential reasons that DETR performed worse than YOLO models could be attributed to their architecture which relies on a complex two-stage process. The process could be beneficial in capturing global context and relationships but it requires computational power. Moreover, DETR’s set prediction mechanism might struggle with precise localization, especially with dense object images. 4.2. Phase 2 Analysis As a next step, we took the best performing models in Phase 1 and conducted Phase 2 experiments using them; this included training and fine-tuning them, and performing predictions once again. We measured only mAP. In Phase 2 of the experiment, where the models were fine-tuned and trained on a custom dataset containing street navigation-specific classes not present in the COCO dataset—such as roadblocks and tactile pavement—YOLOv8x emerged as the top performer in terms of accuracy, achieving a Mean Average Precision (mAP) of 0.471. YOLOv8-Worldv2 came in second place, with an mAP of 0.468. Real-Time DETR ranked third, with an mAP of 0.439, showing an improvement of 5.86% from Phase 1, indicating better performance after fine-tuning on the custom dataset. In contrast, DETR-Original performed the worst in terms of accuracy, with its performance even degrading from Phase 1. This degradation could be attributed to catastrophic interference or catastrophic forgetting, which can occur in models when adapting to new tasks or datasets during training. This can happen when a model is first trained on a large dataset, and then fine-tuned with a smaller one, which happened in our case. This can lead the model to overwrite the weights and cause this "catastrophic forgetting" phenomena. According to Li and Hoiem [24], this could potentially be fixed by freezing all or some of the early layers in the backbone (in this case, ResNet-50). Another reason could be due to the learning rate hyper-parameter which can either overshoot the optimal minima of the loss function if too high, or predict sub-optimally if too low, due to the gradient descent process converging slowly, requiring many iterations to reach an optimal or near-optimal solution. This can potentially be fixed by tweaking the learning rate hyper-parameter. 4.3. Making Sense of Results The evaluation of the results provided valuable insights into the most suitable models for future use in street navigation for visually impaired individuals, particularly in determining whether Transformer- based models like DETR or RT-DETR are the optimal choice. The findings indicated that while Transformer-based models offer certain advantages, they fall short in both accuracy and speed when compared to YOLO models, especially YOLOv8x. The superior performance of YOLOv8x, with its high accuracy and significantly faster inference times, suggests that it is better suited for real-time applica- tions where timely and precise object detection is crucial, such as in the context of street navigation for the visually impaired. Therefore, in addressing the research question - whether object detection models utilizing Transformer architecture can be more effective than other state-of-the-art models in terms of accuracy and speed for street navigation - the answer is no; Transformer-based models are not better suited. The result indicated that YOLO models, particularly YOLOv8x, outperform Transformer-based models. Thus, YOLOv8x emerged as the most effective model for this application, based on the models we evaluated, offering the best balance of accuracy and speed needed to assist visually impaired individuals in navigating streets safely and efficiently. 4.4. Discussion In our study, the findings indicated that YOLOv8 models consistently outperform Transformer-based models, particularly DETR, in both accuracy and speed within the context of street navigation for visually impaired individuals. These results align with the literature, which highlights the efficiency and accuracy of YOLO models in object detection tasks. For instance, Acar et al. [10] emphasized the improved performance of YOLOv8 in real-time applications, which is confirmed by our findings where YOLOv8 achieved a higher mAP and significantly faster inference times compared to DETR. However, the literature also suggested the potential of Transformer-based models, particularly in handling complex object detection scenarios, as noted by Carion et al. [13] with the introduction of DETR. Despite this, our results showed that DETR models, while innovative, do not yet surpass the well-optimized YOLO architecture in scenarios requiring rapid and accurate detection, such as street navigation for the visually impaired. This highlights a gap between the theoretical advantages of Transformer and their practical application in time-sensitive environments, suggesting that further optimization is needed for Transformer models to become competitive in this domain. Some of the limitations we faced in this paper are computational power and dataset variation. In terms of computational power, while this research had access to standard off-the-shelf computational resources (Google Colab Pro+), having dedicated resources can help with further fine-tuning of hyper- parameters, such as training for higher epochs and real-time video formats. In terms of dataset variation, it is important to have variation in the dataset in terms of geographical location, weather, and miscellaneous objects. Since infrastructure elements like roadblocks and tactile pavements vary by region, it is crucial to customize datasets for specific locations. Moreover, models must be trained to handle a range of weather conditions, particularly challenging scenarios like rain or fog. Additionally, including objects such as potholes or temporary construction is vital. Crowd-sourcing this data could further enhance the dataset’s comprehensiveness. 5. Conclusion and Future Work In conclusion, this paper has made several important contributions to the field of street navigation for visually impaired individuals. First, we conducted a detailed evaluation of object detection models utilizing Transformer architectures, assessing their performance in terms of accuracy and speed against state-of-the-art models like YOLOv8. Our key findings indicate that Transformer-based models, such as DETR and RT-DETR, were outperformed by YOLOv8x and YOLOv8x-Worldv2 in both accuracy and speed, confirming that Transformer architectures are not yet ideal for street navigation applications in this context. Despite the aforementioned limitation, this paper manages to present a reliable and thorough evaluation of the models and architectures. YOLO models outperformed Transformer-based models like DETR primarily due to architectural differences. YOLO’s single-stage detection which operates on grid-based bounding boxes allows for faster and more accurate object detection, whereas DETR’s two-stage process, which captures global context, requires more computational power and struggled with precise localization. DETR also experienced issues like catastrophic forgetting during training and fine-tuning, leading to degraded performance. Second, we introduced an enhanced version of the Walking On the Road Dataset (WOTRv2), which is specifically tailored for object detection in street navigation for the visually impaired, offering a valuable resource for future research and development in this area. Third, we developed and implemented a comprehensive methodology and a structured approach for evaluating object detection models for visual impairment. These contributions collectively advance the development of more accurate and efficient solutions that can better assist visually impaired individuals in outdoor navigation. In terms of future work, firstly, incorporating a feedback mechanism is crucial; while the detection and identification of objects by the models are useful, this information must be utilized effectively. Feedback could be auditory, visual, textual, or haptic, providing essential guidance for visually impaired individuals during navigation. Secondly, integrating hardware is necessary to make the software-based solutions more practical and accessible. References [1] I.-M. Lee, D. M. Buchner, The importance of walking to public health, Medicine & Science in Sports & Exercise 40 (2008) S512–S518. [2] S. R. Flaxman, R. R. Bourne, S. Resnikoff, P. Ackland, T. Braithwaite, M. V. Cicinelli, T. Vos, Global causes of blindness and distance vision impairment 1990–2020: a systematic review and meta-analysis, The Lancet Global Health 5 (2017) e1221–e1234. doi:10.1016/S2214-109X(17) 30393-5. [3] A. Yekta, E. Hooshmand, M. Saatchi, H. Ostadimoghaddam, A. Asharlous, A. Taheri, M. Khabazkhoob, Global prevalence and causes of visual impairment and blindness in chil- dren: A systematic review and meta-analysis, Journal of Current Ophthalmology 34 (2022) 1–15. URL: https://www.jcurrophthalmol.org. doi:10.4103/joco.joco_135_21. [4] G. A. Stevens, R. A. White, S. R. Flaxman, H. Price, J. B. Jonas, J. Keeffe, J. Leasher, K. Naidoo, K. Pesudovs, S. Resnikoff, et al., Global prevalence of vision impairment and blindness: magnitude and temporal trends, 1990–2010, Ophthalmology 120 (2013) 2377–2384. doi:10.1016/j.ophtha. 2013.05.025. [5] M. M. Islam, M. S. Sadi, K. Z. Zamli, M. M. Ahmed, Developing walking assistants for visually impaired people: A review, IEEE Sensors Journal 19 (2019) 2814–2827. doi:10.1109/JSEN.2018. 2890423. [6] S. Khan, S. Nazir, H. U. Khan, Analysis of navigation assistants for blind and visually impaired people: A systematic review, IEEE Access 9 (2021) 26712–26729. doi:10.1109/ACCESS.2021. 3052415. [7] G. Dimas, D. E. Diamantis, P. Kalozoumis, D. K. Iakovidis, Uncertainty-aware visual perception system for outdoor navigation of the visually challenged, Sensors 20 (2020) 2385. URL: https: //www.mdpi.com/1424-8220/20/8/2385. doi:10.3390/s20082385. [8] M. O. Masud, M. F. Rahman, M. R. Islam, M. S. Hossain, M. K. Rahman, M. M. Hasan, A smart assistive system for the visually impaired using raspberry pi and machine learning, IEEE Access 10 (2022) 11650–11659. doi:10.1109/ACCESS.2022.3146320. [9] R. B. Islam, S. Akhter, F. Iqbal, M. S. U. Rahman, R. Khan, Deep learning based object detection and surrounding environment description for visually impaired people, Heliyon 9 (2023) e16924. URL: https://www.sciencedirect.com/science/article/pii/S2405844023004127. doi:10.1016/j.heliyon. 2023.e16924. [10] T. Acar, A. Solmaz, A. S. Bozkir, I. Cengiz, From pixels to paths: Sight - a vision-based navigation aid for the visually impaired, in: 2024 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2024. doi:10.1109/HORA61326.2024.10550694. [11] A. B. Atitallah, Y. Said, M. A. B. Atitallah, M. Albekairi, K. Kaaniche, S. Boubaker, An effective obstacle detection system using deep learning advantages to aid blind and visually impaired navi- gation, Ain Shams Engineering Journal 15 (2024) 102387. doi:10.1016/j.asej.2023.102387, available online 16 July 2023. [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008. [13] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in: European conference on computer vision, Springer, 2020, pp. 213–229. [14] X. Zhu, W. Su, L. Li, X. Wang, J. Dai, Deformable detr: Deformable transformers for end-to-end object detection, in: International Conference on Learning Representations, 2020. [15] Y. Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y. Liu, J. Chen, Detrs beat yolos on real-time object detection, 2024. URL: https://arxiv.org/abs/2304.08069. arXiv:2304.08069. [16] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788. [17] J. Redmon, A. Farhadi, Yolo9000: better, faster, stronger, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271. [18] T. Cheng, L. Song, Y. Ge, W. Liu, X. Wang, Y. Shan, Yolo-world: Real-time open-vocabulary object detection, 2024. URL: https://arxiv.org/abs/2401.17270. arXiv:2401.17270. [19] J. Terven, D.-M. Córdova-Esparza, J.-A. Romero-González, A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas, Machine Learning and Knowledge Extraction 5 (2023) 1680–1716. URL: https://www.mdpi.com/2504-4990/5/4/83. doi:10. 3390/make5040083. [20] O. K. T. Alsultan, M. T. Mohammad, A deep learning-based assistive system for the visually impaired using yolo-v7, Revue d’Intelligence Artificielle 37 (2023) 901–906. URL: http://dx.doi.org/ 10.18280/ria.370409. doi:10.18280/ria.370409. [21] H. Xia, C. Yao, Y. Tan, S. Song, A dataset for the visually impaired walk on the road, Displays 79 (2023) 102486. URL: https://www.sciencedirect.com/science/article/pii/S0141938223001191. doi:10. 1016/j.displa.2023.102486. [22] G. Jocher, A. Chaurasia, J. Qiu, Ultralytics yolov8, 2023. URL: https://github.com/ultralytics/ ultralytics. [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825– 2830. [24] Z. Li, D. Hoiem, Learning without forgetting, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2018) 2935–2947. doi:10.1109/TPAMI.2017.2773081.