=Paper=
{{Paper
|id=Vol-3248/paper20
|storemode=property
|title=HILO: High-level and Low-level Co-design, Evaluation and Acceleration of Feature Extraction for Visual-SLAM using PYNQ Z1 Board
|pdfUrl=https://ceur-ws.org/Vol-3248/paper20.pdf
|volume=Vol-3248
|authors=Muhammad Bilal Akram Dastagir,Omer Tariq,Dongsoo Han
|dblpUrl=https://dblp.org/rec/conf/ipin/DastagirTH22
}}
==HILO: High-level and Low-level Co-design, Evaluation and Acceleration of Feature Extraction for Visual-SLAM using PYNQ Z1 Board==
HILO: High-level and Low-level Co-design, Evaluation and Acceleration of Feature Extraction for Visual-SLAM using PYNQ Z1 Board Muhammad Bilal Akram Dastagir1, Omer Tariq1 and Dongsoo Han1 1 School of Computing, Korea Advanced Institute of Science and Technology, Daejeon, South Korea Abstract Image features are widely employed in embedded computer vision applications, from object identification and tracking to motion estimates, 3D reconstruction, and visual simultaneous localization and mapping (VSLAM) applications. Due to the real-time needs of such applications over a continual stream of input data, efficient feature extraction and description is critical. Significant-speed processing is often associated with high power consumption, yet embedded systems are mostly power and resource-limited, making the development of power-aware and compact solutions all the more important. The performance of the low-cost feature detection and description algorithms implemented on particular embedded devices is evaluated in this work (embedded processing units, GPUs, and FPGAs). We implemented the ORB-based feature extraction hardware accelerator using PYNQ overlay in the embedded PYNQ Z1 board. We demonstrated that a speedup of 8.38x was achieved utilizing a hardware-accelerated core compared to the algorithm running on a processor-based software solution. Keywords Feature Extractor, Feature Detector, Visual SLAM, FPGA, Acceleration, PYNQ, Overlay, ORB-SLAM 1. Introduction Simultaneous Localization and Mapping (SLAM) has received much recognition in recent years due to its path planning, map building, and navigation, which offers important technical advantages in Autonomous Navigation Systems [1] [2] [3]. Simultaneous localization and mapping, or SLAM, is a technique used to map an unknown location while simultaneously localizing the agent using SLAM on the same map. Visual SLAM (which uses pictures from cameras and other image sensors) and LiDAR SLAM are the two types of SLAM currently in use (which use a laser or a distance sensor). It can calculate the pose estimation and a 3D reconstruction of the area in various configurations, from hand-held sequences to cars being driven over several city blocks. Many researchers and scientists have been exploring and adapting the implementation of SLAM using cameras as Visual-SLAM due to its simplicity, cost-effectiveness, and rapid deployment compared to the alternatives sensors in embedded systems [4] [5]. Visual-SLAM has various applications, from autonomous driving cars to other domains _____________________________ IPIN 2022 WiP Proceedings, September 5 - 7, 2022, Beijing, China EMAIL: bilal@kaist.ac.kr (Muhammad Bilal Akram Dastagir); omertariq@kaist.ac.kr (Omer Tariq); ddsshhan@kaist.ac.kr (Dongsoo Han) ORCID: 0000-0003-2990-4604 (Muhammad Bilal Akram Dastagir); 0000-0002-1771-6166 (Omer Tariq); 0000-0002-2396-1424 (Dongsoo Han) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) like airborne navigation, Advanced Driver Assistance Systems (ADAS), Augmented Reality (AR), and Virtual Reality (VR), and has also been explored as a new emerging positioning and navigation solution [6] [7] [8]. One of the most noticeable low-level features insensitive to scale, rotation, noise, and light is the Scale-Invariant Feature Transform (SIFT) [9]. Like SIFT, the suggested Speeded-UP Robust Feature (SURF) [10] has a reduced computing complexity and performs almost as well. The Histogram of Gradients is used by both of them to find repeating keypoints in scale space and to produce descriptions (HoG). Despite SIFT/SURF obtaining high-quality features, real-time implementation of those features is challenging owing to heavy computation and memory access. Likewise, Visual-SLAM algorithms like feature-based, direct, or RGB-D camera-based approaches are very compute-intensive and thus cause a high latency in dealing with image processing algorithms, making it challenging on resource-constraint devices for real-time implementation on ADAS [11] [12] [13]. The literature only contains a relatively small number of entire embedded operational chains that adhere to the ADAS requirements for fast processing times, low power consumption, and area-aware design footprints. This problem is partially caused by the modern image processing algorithms’ inherent complexity and the amount of data (number of pixels) that must be handled. One way to address this high performance computing issue is; either utilize Graphing Processing Unit (GPU) or Field Programmable Array (FPGA) for fast execution and real-time implementation of the algorithms, and there have been various research explored in this domain [14] [15]. Although the conventional FPGA design environment supports the algorithm acceleration but lacks software-like programming functionality, thus results in time-consuming application development [16]. This paper presents an efficient high-level and low-level co-design, evaluation and acceleration of real-time ORB feature extraction for visual-SLAM using the PYNQ Z1 board. We evaluate the implementation of feature detection using the PYNQ platform, which has system-on-chip (SoC) encapsulation of the Arm-processor and FPGA by a python-based development environment and tools. This integrated hardware/software co-design environment encapsulating the FPGA formed a robust, rapid, and reliable co-framework in which execution of the compute-intensive algorithms increased significantly. The rest of this paper is organized as follows. Section II is dedicated to the literature review of the visual SLAM in software and hardware domains. Section III reviews the background information of PYNQ architecture and its co-design with the ORB-based feature algorithm. Section IV discusses the evaluation of our experimental setup for the proposed hardware feature extractor, the implementation using hardware-software co-design, and the promising results. In Section V, we summarize our conclusion and future work. 2. Related Work The simultaneous localization and mapping (SLAM) problem has now received considerable attention from the robotics world, and many scholarly solutions have been put forth. Lee et al. [17] suggested a real-time RGB-D 3D SLAM system based on a GPU and only an RGB-D sensor. The 3D SLAM system’s operation speed is accelerated by GPU computing, with an average processing rate above 20 Hz. Giubilato et al. [18] proposed a ROS interfaced visual SLAM set in an indoor setting using a 6-wheeled ground rover equipped with a stereo camera, LiDAR, and TX2 computer platform. By leveraging an embedded computer platform to report realistic results during real-life mobile robot operations, the investigation demonstrated image rectification on the CPU and GPU and presented a fresh benchmark for visual SLAM algorithms. Peng et al. [19] provided a comprehensive and quantitative performance assessment of ORB-SLAM2 and OpenVSLAM on the platform. Tianji, et al. [20] investigated the front end of visual SLAM using an integrated GPU and developed a front-end solution for parallelization. The visual SLAM system was deployed on the embedded platform using GPU parallelization, and the EuRoC MAV dataset was used to verify the system’s efficacy. On the other hand, D. T. Tertei et al. [21] proposed an efficient architecture for tri-matrix hardware accelerator based on Virtex5 XC5VFX70T FPGA that leverages matrix multiplications and updates the cross-covariance matrix in the correction loop of a 3D EKF-based SLAM algorithm. The accelerator is executed at 44.39Hz and permitted a real-time visual SLAM with 45 observed and corrected landmarks. Vourvoulakis et al. [22] demonstrated a completely pipelined and optimized architecture for detecting SIFT keypoints and extracting SIFT descriptors in real-time. The system is designed for robotic vision applications and runs on a Cyclone IV FPGA chip with a clock speed of 21.7 MHz and a feature extraction time of 46 nanoseconds. Furthermore, the suggested solution has good responsiveness and repeatability values, and its matching ability is directly similar to SIFT implementations based on floating-point software. Fang et al. [23] proposed an ORB-based feature extractor capable of running at 203 MHz to reduce energy consumption and computational intensity and outperformed several state-of-the-art processors in terms of latency, energy, and performance. Qi Ni et al. [24] introduced SURF feature detection, BRIEF descriptor construction, and matching algorithms for binocular vision systems that provided feature point correspondences and parallax information at 162fps @640 480 on a ZYNQ SoC device. Ayoub Mamri et al. [25] proposed implementing the ORB-SLAM2 algorithm targeting a heterogeneous hardware/software optimization approach. They executed the optimization in an FPGA-based heterogeneous embedded architecture and compared the outcomes with those of other heterogeneous architectures, including powerful embedded GPUs (NVIDIA Tegra TX1) and high-end GPUs (NVIDIA GeForce 920MX). Their implementation utilized high-level synthesis-based OpenCL for FPGA and CUDA for NVIDIA targeted devices. T. Imsaengsuk et al. [26] proposed a feature detector and descriptor based on the ORB algorithm that could be accelerated by exploiting parallelism and pipelining. They applied a binomial filter instead of a Gaussian filter, used a heap sorting algorithm to balance the number of feature points and designed the data processing pipeline and orientation estimation method to design optimal hardware architecture. In the proposed design, they maintained the consistency of corner numbers; however, their feature detector could only detect single corner points. 3. HILO co-design using PYNQ overlay 3.1. PYNQ PYNQ [27] [28] is a framework created by Xilinx to make it easier for software programmers to use FPGAs. The PYNQ package allows designers to utilize hardware designs from Python instead of C- based drivers. For instance, designers may conduct DMA transfer with just a few lines of Python code. Through Jupyter notebooks, designers may leverage FPGA in the PYNQ framework. The system-on- chip zc7020 integrates an FPGA and a dual-core CPU onto a single chip, and the PYNQ overlay is installed on the PYNQ-Z1 FPGA board. 3.2. OpenCV The widely used Open Source Computer Vision Library (OpenCV) is the foundation for various image processing applications. It is open-source and contains over 2500 optimized computer vision and machine learning algorithms. The library is written in C++ and includes a Python API, making it ideal for the PYNQ platform. OpenCV has much support for image and video file handling and camera interfacing. If the platform supports vector units like SSE or NEON, OpenCV functions use them. CUDA and OpenCL interfaces are being actively developed to support GPU execution. The PYNQ software already includes OpenCV, so the functions can be used in a Python application because of the library’s benefits. 3.3. Hardware Library Figure 1 depicts the system architecture, including the layers and their connections. PYNQ provides a web-based interface for programming Python applications using notebooks. On a single board, several of these notebooks can run in parallel. PYNQ provides modules to interface with programmable logic and the standard Python libraries. Modules such as Direct Memory Access (DMA) and Memory Mapped I/O (MMIO) provide different ways to exchange data with the hardware IP cores. The Overlay module Figure 1. HILO co-design framework using PYNQ controls the bit file loaded into the programmable logic and the hardware hierarchy. When using the DMA data exchange, physically contiguous memory must be allocated for the DMA IP-core to access a data block in ascending order. As a result, the link module provides contiguous memory allocation in a unique region of the Linux memory hierarchy. The introduced library is divided into two sections. The kernels for feature extraction are included in a custom bit file with the library. The second component is the Python library, which provides the application API and handles the correct interaction with the hardware. It accesses the programmable logic through the PYNQ framework’s interfaces. This design allows for dynamic target computing unit selection and future expansions. These operations can be efficiently implemented using a sliding window operation on FPGAs. This method allows for streaming by caching image lines. The filter mask’s height determines the number of cached lines. 3.4. Overlay The OpenCV library of the main PYNQ project does not support image processing accelerators running on FPGAs. Every OpenCV function call on the PYNQ platform is executed on the ARM CPUs with their NEON units. Therefore, we decided to make the first step towards supporting FPGAs and developing a python library that extends the capabilities of OpenCV and sources computation-intensive tasks out to the programmable logic. This library concept consists of several parts. The hardware design for the programmable logic part of the PYNQ platform is called an overlay [29]. Additionally, a Python library interfacing the overlay and providing the API to the user is needed. In Fig 2, PYNQ based notebook frame has been shown how both hardware and software libraries are included in the python framework for rapid prototyping and development of the application. Similarly, in Fig 3, it is demonstrated that how FPGA bit stream file is being loaded along with the initialization and assignment of DMA AXI with python variables for hardware-software handshaking for the acceleration. 3.5. ORB Feature Extractor A process for extracting features from an image is known as feature extraction. It is divided into two parts: (Oriented Feature from Accelerated Segment Test) based feature extraction and BRIEF (Binary Robust Independent Elementary Features) based feature descriptors computation, with Gaussian Filter processing available only in the FPGA environment, which allows parallel computation, unlike software implementation. Figure 2. Importing Software, Hardware Libraries using Overlay Figure 3. Loading FPGA bit File using Overlay into Jupyter PYNQ 3.5.1. Gaussian Kernel The image is blurred using a Gaussian filter to decrease noise and eliminate speckles from the image. It is critical to filter out high-frequency components not related to the gradient filter in use; otherwise, it may lead to false edge detection. The Gaussian function is represented in Equation 1 as x2 1 − (1) G(x) = √ e 2σ 2 2 π σ2 Where σ is the distribution’s standard deviation The mean of the distribution is assumed to be 0 [30]. The Gaussian function is a smoothing operator that provides a probability distribution for noise or data. It is employed in a variety of study fields. While working with images, we will use the two-dimensional Gaussian function, the product of two 1D Gaussian functions. The Gaussian filter employs the 2D Gaussian distribution function as a point-spread function, which is accomplished by the 2D Gaussian distribution function with the image to obtain a discrete approximation to the Gaussian function. 3.5.2. Extractor To implement the real-time implementation of a SLAM using a mobile robot with limited computational resources, features from accelerated segment test (FAST) are an excellent choice in the category of corner detection algorithms comparable to SIFT and SURF, which could be utilized in the feature extraction of points and map the objects in image processing tasks. To calculate the orientation component of FAST, ORB employs the intensity centroid. The intensity centroid may infer an orientation that a corner’s intensity skewed from its central point. The image moment may be calculated as Eq. 2: p q P mpq = x, y x y I (x, y) (2) In equation (2), p and q are natural numbers representing the moment order in each dimension, and I (x, y) represents the strength of the pixel at the relative location. It is feasible to calculate the orientation centroid using this definition: m01 m10 C = , (3) m00 m00 then calculate the centroid C, OC, and the corner’s center point O using Eq. 4. φ = a tan 2(m01 , m10 ) (4) The quadrant-aware arctangent is known as atan2. The possibility to compute sin(Ø) and cos(Ø) using the m values in the following Eq. 5: m10 m10 sin (φ) = p 2 2 , cos (φ) = p 2 (5) m01 + m10 m01 + m210 A pixel’s status as a corner is determined by the segment test, and oFAST is made up of the placement calculation. To create a set of features, this data is used. The pseudo-code for the FAST algorithm [31] is explained as: Input: Any RGB or greyscale Image Output: A set of key points for selected image (i) Initiate FAST object “FastFeatureDetector_create()” with default values (ii) Find the key points using the “detect()” method. (iii) Draw the key points using the “drawKeypoints()” method. (iv) Disable the Non-Max Suppression (NMX) to remove the bounding box. (v) Display the image with the key points that appeared 3.5.3. Descriptor We will use the rotation-aware Binary Robust Independent Elementary Features (rBRIEF) algorithm as a descriptor block. The brief algorithm’s major points are all converted into a binary feature vector via the BRIEF technique, which can subsequently be used to represent an object. A feature vector containing only the integers 1 and 0 is considered binary. All in all, a feature vector, a 128–512 bit string, is used to represent it. Starting with image smoothing with a Gaussian kernel, we avoid the descriptor being sensitive to high-frequency noise. Select a random pair of pixels in the area around that important point that is defined. A pixel’s specified neighborhood is represented by a square of defined width and height, known as a patch. The first pixel in each random pair is chosen from a Gaussian distribution with a stranded deviation, also known as a spread of sigma, centered on the important point. In a random pair of pixels, the second pixel is chosen randomly from a Gaussian distribution with a standard deviation or spread of sigma by two, centered on the first pixel. The appropriate bit is either set to 1 or 0. Once more, choose a random pair and give them a value. For a crucial location in a 128-bit vector, briefly repeat this procedure 128 times. Build a vector that looks like this for each image. Moreover, BRIEF is rotation invariant; therefore, ORB utilizes rotation-aware BRIEF [32] [33]. The in-plane rotation invariance of ORB, one of its most appealing characteristics, is achieved by rotating the binary test coordinates of rBRIEF in accordance with the orientation discovered earlier by oFAST. A set of binary intensity tests is used to create the rotation-aware BRIEF (rBRIEF) descriptor. A binary test τ is defined as follows: 0 , I(p1 ) < I(p2 ) τ (p1 , p2 ) = (6) 1 , I(p1 ) ≥ I(p2 ) where I(pi ) represents the intensity of the point pi , and p1 and p2 represents the 2D points. A vector of n binary tests is used to define the feature: i−1 τ (p , p ) P fn (p) = 1≤i≤n 2 1i 2i (7) Before carrying out these tests, it is recommended to smooth the image first, maybe using a Gaussian filter. The vector normally has a dimension of 256. The algorithm for Rotation-aware BRIEF (rBRIEF) [34] is explained as: (i) For all of the training patches, compare each test. (ii) By arranging the tests in order of their deviation from a mean of 0.5, the vector T is generated. (iii) The first test should be taken out of T and put in the result vector of R, applying greedy search. (iv) Take the next test from T and contrast it with the other tests from R. The absolute correlation is ignored if it exceeds a certain threshold; otherwise, it is added to R. (v) Until R has 256 tests, repeat the preceding procedure. Increase the criterion and try again if the number of outcomes is fewer than 256. 3.6. Hardware accelerated ORB feature extractor The proposed FPGA based hardware accelerator for the feature extraction system is shown in Figure 4. ARM multi-core processors handle image loading and other computations in the system. Connected memory systems, feature extraction accelerators, and ARM processors all use the AXI bus. At the top system-level, the feature extraction accelerator is a data accessing part and a kernel part. Instruction memory, control unit, and feature extractor make up the kernel. Input buffer, output buffer, and DMA interface make up the data accessing section. The DMA interface uses the AXI bus to directly access the image’s data stream and stores it in the feature extractor’s input buffer. The feature extractor’s results are saved in the output buffer before being sent to the ARM multi-core System, which uses them for further processing. It is necessary to use hardware elements that support a sliding window access pattern while processing an input stream of pixels from onboard memory or image sensors. This arrangement efficiently takes advantage of the spatial and temporal localization of the processes. The accelerator receives the pixel stream at a rate of one pixel per cycle, which is the standard order for image data. For feature detection and rBRIEF descriptor computation, the pixel stream travels two cooperative data routes. A functional component of the first route is the FAST feature detector. Blocks for orientation computation, BRIEF descriptor generation, and Gaussian image smoothing make up the second route. The rBRIEF block design in this study blends replication with a static approach of pattern reordering. The pixel stream is delayed using the FIFO to keep the two paths in synchronization. After a specific number of fixed rotations, the FIFO structure allows components to leave the structure in FIFO order. A circular index indicating the place from which to read and write can be used to implement it in memory efficiently. The accelerator synchronizes all sliding windows required for properly operating the aforementioned functional blocks while only accessing each pixel of input images once. The utilization of tiling in the proposed architecture, which increases performance and flexibility to process various image resolutions while requiring less on-chip memory storage, is another crucial component. An image is divided into several little rectangular "tiles." using a tiling method. The accelerator can process each of those fixed- size tiles transparently since it makes no difference whether the incoming data is a fragment of a single image or a subset of a larger one. The accelerator’s tiling enhances locality and lowers the need for on-chip memory. The most difficult aspect of the ORB extraction is implemented in the BRIEF block since it needs to grant memory access to 512 places that rely on the feature angle to process the descriptor block. Finding a suitable schedule for these accesses is challenging since the angle relies on the input data. The tests’ necessary data are stored in the implementation’s 37 × 37 sliding windows. This window is synced with the other blocks to hold a patch centered on the feature candidate. The structure must permit the generation of the descriptor using the sliding window dataflow approach and random access to 512 places. Figure 4. Proposed hardware accelerated ORB feature extractor 4. Evaluation We study and compare the performance impacts of employing PYNQ for feature extraction application development using C, Python, OpenCV libraries, and custom accelerators. The various testing configurations used in the experimental setting are explained before the study and analysis of the results. 4.1. Experimental Setup The Xilinx PYNQ platform [35] was employed for this study, which contains the Zynq system-on-chip component and DDR3 memory of 512 MB. The dual-core ARM CPU operates at 667 MHz, with the FPGA logic core operating at 125 MHz. Feature Extraction was performed on a 640x480 grayscale image in the experiment, which is a common step, and ORB was chosen as an accelerator because of the robust and open source libraries of image processing in visual SLAM [36]. Three alternative software and hardware setups have been utilized in the experimental analysis. C and Python performance must be compared to achieve our goal in an embedded development environment for feature extraction. A specially customized and incorporated OpenCV kernel was utilized in this experiment. Utilizing these libraries, the register transfer level intellectual property (IP) blocks for Gaussian filtering, the FAST Algorithm, and a rBRIEF detector were created and synthesized in the FPGA for production of the bitstream file to be utilized by the PYNQ ARM processor in the Juptyer Notebook employing overlay using dynamic memory allocation (DMA). The algorithm of the feature extraction accelerator was developed using Vitis HLS tool employing rapid productivity. Three distinct streaming IPs have been developed: Gaussian Filter, Extractor, and Descriptor. The hardware-accelerated version on the FPGA fabric uses three IPs operating at 125 MHz. Multiple research articles prove that the feature extraction on FPGAs may outperform software implementations on processors and GPUs. 4.2. Implementation Figure 5. Main function of the processor and FPGA Functions Figure 5, shows the primary function written in the python, which implements both software and hardware version of the feature extraction and test image. In addition, Fig 6 and Fig 7 show feature extraction implementation on the processor and FPGA, respectively. Figure 6. Feature Extraction implemented on Processor 4.3. Results and Analysis Fig 8 shows the results of implementing the co-design of the feature extraction both on the PYNQ ARM Processor and FPGA and the Intel CPU implementation. The red marker depicts the features using FPGA, while green depicts using the processor. Similarly, blue depicts the OpenCV-python-based features on Intel CPU with 16Gb of DDR4 RAM. The timing information is presented in Table 1. The results of executing Feature Extraction on three distinct hardware and software setups are shown in Fig 9, Fig 10 and Table 1: Python’s performance on the Intel CPU is demonstrated. On the PYNQ ARM A9 cores, a Python OpenCV implementation was achieved with minimal effort. Python was in combination with a hardware-accelerated core.The intricacy of portability and the need for cross- compilers and device drivers are decreased when using a platform like PYNQ, where Python serves as the primary interface for programmers to the hardware. Programming that reads and writes data via MMIO and DMA is made possible by PYNQ, which greatly reduces system design complexity. A developer may quickly construct, test, and modify their program using the profiling and debugging capabilities built into Python or available through libraries and packages. A hardware-accelerated core achieved a Figure 7. Feature Extraction implemented on FPGA Figure 8. Feature Extraction using three configurations Table 1. Experimental Result Comparison for Feature Extraction Execution Time Configuration Clock Frequecny Time (s) Speedup (x) Intel CPU with 16Gb of DDR4 2.3 GHz 8-Core Intel 0.04291 2.43x Slower than RAM Core i9 FPGA PYNQ CPU with 512 MB of 667 MHz Dual Core 0.148055 8.38x Slower than DDR3 RAM. ARM FPGA PYNQ FPGA Xilinx 125 MHz 0.017659 2.43x Faster than Intel xc7z020clg400-1 CPU 8.38x Faster than PYNQ ARM CPU speedup of 8.38x compared to the feature extraction algorithm running solely on an arm processor. Figure 9. Execution Time Comparison using three platforms 5. Conclusion & Future Work As system-on-chips become more heterogeneous and complex, a more software-oriented development environment is required. For software developers using Python to access the FPGA, Xilinx just launched PYNQ. A big step toward reaching a broader developer community, comparable to that of Raspberry Pi and Arduino, is being made possible by the combination of Python software and the parallel performance Figure 10. Speedup comparison using three platforms capability of the FPGA. This research examined the performance of popular image processing methods like feature extraction for the visual SLAM in Python and specialized hardware accelerators to better understand the performance and capabilities of a Python and FPGA development environment. With a speedup of up to 8.38x, the results are very promising, with the ability to match and even exceed CPU performance in FPGA implementations. Furthermore, the results demonstrate that Python may still help software developers improve speed despite using extremely effective modules like OpenCV. This early research shows the operation of PYNQ and how Python communicates with the hardware accelerators and programmable fabric. The results are promising, so we are currently testing more algorithms in various visual SLAM-based image processing and machine learning domains. We will design, evaluate, and implement future hardware-accelerated feature matching for Visual SLAM. 6. Acknowledgements This research was supported by Capacity Enhancement Program for Scientific a nd C ultural Exhibition Services through the National Research Foundation of Korea (NRF) funded by Ministry of Science and ICT (2018X1A3A106860331). We thank Professor Dongsoo Han from KAIST, who provided insight and expertise that greatly assisted the research, particularly the suggestions that greatly improved the manuscript. We would also like to thank the editors of the IPIN 2022 conference, Hong Yuan, Dongyan Wei, Wen Li and Antoni Pérez-Navarro for sharing their pearls of wisdom with us during this research. References [1] Bailey T et al. 2006 IEEE Robotics & Automation Magazine 13 108 – 117 URL 10.1109/MRA. 2006.1678144 [2] Lin S C 2018 Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems [3] Yang S et al. 2017 Robotics and Autonomous Systems 93 116–134 URL https://doi.org/10.1016/j. robot.2017.03.018 [4] Lategahn, Henning A, Geiger B and Kitt 2011 Visual SLAM for autonomous ground vehicles 2011 IEEE International Conference on Robotics and Automation, Shanghai, China ed and others (IEEE) URL 10.1109/ICRA.2011.5979711 [5] Jeong W, Yeon K M and Lee 2006 Visual SLAM with line and corner features 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, 09-15 October 2006, Beijing, China ed and others URL 10.1109/IROS.2006.281708 [6] Ni Q 2019 Proceedings of the 2019 4th International Conference on Multimedia Systems and Signal Processing [7] Dumble S J and Gibbens P W 2015 Journal of Intelligent & Robotic Systems 78 185–204 [8] Yun S 2018 International Journal of Control, Automation and Systems 16 912–920 [9] Lowe G 2004 International journal of computer vision 60 91–110 [10] Bay H, Ess A, Tuytelaars T and Van Gool L 2008 Speeded-up robust features (surf),” Computer vision and image understanding 110 346–359 [11] Taketomi T, Uchiyama H and Ikeda S 2017 Visual SLAM algorithms: A survey from 2010 to 2016 9 1–11 [12] Tertei D, Törtei J, Piat M and Devy 2016 Computers & Electrical Engineering 55 123–137 [13] Sun R, Liu P, Wang J and Zhou Z 2017 2017 IEEE International Symposium on Circuits and Systems (ISCAS 1–4 [14] Park I and Kyu 2010 IEEE Transactions 22 91–104 [15] Hajirassouliha A 2018 Signal Processing: Image Communication 68 101–119 [16] Schmidt A G, Weisz G and French M 2017 IEEE 25th Annual International Symposium on Field- Programmable Custom Computing Machines (FCCM) [17] Lee D, Kim H and Myung H 2012 2012 9th international conference on ubiquitous robots and ambient intelligence (urai) [18] Giubilato R et al. 2019 Measurement 140 161–170 URL https://doi.org/10.1016/j.measurement. 2019.03.038 [19] Peng T et al. 2020 Electronic Imaging 2020 325–326 URL 10.2352/ISSN.2470-1173.2020.6. IRIACV-074 [20] Ma T 2021 Wireless Communications and Mobile Computing 2021–2021 [21] Tertei T, Piat J and Devy M 2014 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14) 1–6 [22] Vourvoulakis J, Kalomiros J and Lygouras J 2016 Microprocessors and Microsystems 40 53–73 [23] Fang W, Zhang Y, Yu B and Liu S 2017 2017 International Conference on Field Programmable Technology (ICFPT 275–278 [24] Ni Q 2019 Proceedings of the 2019 4th International Conference on Multimedia Systems and Signal Processing [25] Mamri A et al. 2021 ORB-SLAM accelerated on heterogeneous parallel architectures E3S Web of Conferences 229, 01055 (2021) ed and others URL https://doi.org/10.1051/e3sconf/202122901055 [26] Imsaengsuk T, Pumrin S et al. 2021 Feature Detection and Description based on ORB Algorithm for FPGA-based Image Processing 2021 9th International Electrical Engineering Congress (iEECON) ed and others pp 420–423 URL 10.1109/iEECON51072.2021.9440232 [27] PYNQ Z1 Development Boards URL http://www.pynq.io/board.html [28] PYNQ Z1 Board URL https://www.student-circuit.com/news/learn-how-to-program-socs-with- pynq/ [29] PYNQ_Overlays URL https://pynq.readthedocs.io/en/v2.0/pynq_overlays.html [30] Misra S, Wu Y et al. 2020 Machine learning assisted segmentation of scanning electron microscopy images of organic-rich shales with feature extraction and feature ranking URL 10.1016/b978-0-12- 817736-5.00010-7 [31] Kallasi F, Rizzini D L and Caselli S 2016 IEEE Robotics and Automation Letters 1 176–183 [32] Yang Y, Wang X, Wu J, Chen H and Han Z 2015 The 27th Chinese Control and Decision Conference 4996–4999 [33] Fan G, Xie Z C, Huang W, Cao L and Wang IEEE Transactions on Circuits and Systems II: Express Briefs [34] Pham T H, Tran P and Lam S K 2019 IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27 747–756 [35] Pynq Board IO URL http://www.pynq.io/board.html [36] Taranco R, Arnau J M and González A 2021 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) 11–21