CUDA-Based Algorithm for Lidar Position Determination in
Mobile Robotics
Lesia Mochurad, Ostap-Vasyl Matviiv, Halyna Lema, Roksolana Vilhutska
Lviv Polytechnic National University, 12 Bandera street, Lviv, 79013, Ukraine

                Abstract
                Accurate assessment of the Lidar position has always been a fundamental task in any
                navigation system: localization, dynamic map construction, and path planning. In mobile
                robotics, it is usually essential for the robot to know where it is in a known or unknown
                environment. However, processing the large amount of data obtained from LiDAR
                requires a lot of time and large computing resources. Therefore, optimizing the Lidar
                localization problem to obtain a high-quality real-time solution is becoming increasingly
                relevant. The paper proposes an algorithm for optimizing the computational process of
                determining the Lidar position. The method is based on CUDA technology and uses
                graphics processors. This approach is evaluated in comparison with the use of OpenMP
                technology. Hence, based on the proposed algorithm, it was possible to obtain an
                acceleration of 19.8. It is four times higher than the acceleration obtained using OpenMP
                technology. At the same time, the error in determining the Lidar position was 0.001.

                Keywords 1
                Particle Filter algorithm, optimization, parallelization, robotics problem.

1. Introduction
    Autonomous localization and environment perception are essential for a mobile robot to perform
high-level tasks. As for robot localization, odometry, and inertial sensors cannot avoid the problem of
error accumulation. Therefore, exteroceptive sensors must be used to achieve effective localization
and mapping. Among various external sensors, two-dimensional laser sensors can provide accurate
and reliable environmental information with a wide viewing angle. They are widely used in
perceptual tasks, including localization [1-2], mapping, and place recognition [3].
    The problem of determining the Lidar position with optimal accuracy and in an acceptable time is
relevant in various fields of application. Examples include Smart House (robot vacuum cleaners),
vehicle route planning, dynamic map construction, collision avoidance, atmospheric research,
cartography, economic efficiency management [4], etc. As known [5, 6], the Particle Filter algorithm
solves the problem above. However, the number of reading iterations after each movement is
significant if the Lidar moves many times in space. Accordingly, the execution time of this algorithm
increases several times. That is, it is impossible to decide in real-time. One of the options for solving
this problem is the development of a parallel algorithm [7] and its optimization using CUDA
technology [8] and the graphics processors of the NVIDIA video card. Massively parallel hardware
can perform significantly more operations per second than a CPU at a fairly similar cost, resulting in a
performance increase of 50 or more in situations that allow it. CUDA is an SMT parallelism
paradigm. It uses state-of-the-art GPU architecture to provide parallelism. A GPU contains (blocks
(set of cores)) that operate on the same block-step instruction (this is similar to the SIMD model). But

MoMLeT+DS 2023: 5th International Workshop on Modern Machine Learning Technologies and Data Science, June 3, 2023, Lviv,
Ukraine.
EMAIL: lesia.i.mochurad@lpnu.ua (L. Mochurad); ostap20021906@gmail.com (O.-V. Matviiv); halyna.v.mykhailiak@lpnu.ua (H. Lema)
); roksoliana.b.vilhutska@lpnu.ua (R. Vilhutska)
ORCID: 0000-0002-4957-1512 (L. Mochurad); 0000-0002-6929-2537 (O.-V. Matviiv); 0000-0001-5298-7693 (H. Lema); 0000-0002-
9291-8606(R. Vilhutska)
           ©️ 2023 Copyright for this paper by its authors.
           Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           CEUR Workshop Proceedings (CEUR-WS.org) Proceedings
the amount of shared memory and global memory in the GPU is limited. One of the advantages of
CUDA is that a general-purpose language is available, rather than having to use pixel and vertex
shaders to emulate general-purpose computers.
   The essence of the Lidar localization problem is that we may not be given its initial coordinate, but
only points that characterize successive corners of our room. Also, with each step, the measurements
of the Lidar to the room's walls (the Lidar rotates with a certain degree) and the vector of its
movement after the performed actions are specified. In general terms, this algorithm is a successor of
Monte Carlo methods [9, 10].
   Therefore, the aim of this work is to study and optimize the Particle Filter algorithm using CUDA
technologies.

2. Analysis of literary sources
    At the moment, there is a significant number of literary sources dedicated to our problem. A
review of such sources is an important stage of research, as it allows you to familiarize yourself with
the state of scientific research in this area and determine possible ways of further development and
improvement of algorithms.
    Nowadays, various issues in robotics are increasingly being investigated [11-13]. One of the most
interesting problem is determining the current Lidar position. In production and everyday life, this
task can be widely used as a model in many fields, such as vehicle route planning, machine learning,
and economic efficiency management.
    The article [14] is devoted to developing an obstacle detection algorithm based on 2D LiDAR
sensor data. The authors proposed a parallelization method using thread-safe collections and an
efficient multi-threaded DBSCAN algorithm.
    The article [15] provides a detailed analysis of the proposed parallelization algorithm for solving
the Lidar localization problem using a genetic algorithm. The OpenMP parallel computing technology
is used to optimize the computing process in order to speed it up and provide real-time results. Also,
the computational complexity of the sequential algorithm is significantly reduced. Many numerical
experiments were conducted using the multi-core architecture of modern computers. As a result, it
was possible to speed up the computing process by approximately eight times and achieve an
efficiency of 0.97. The notable difference in the execution time of the sequential and parallel
algorithms allows increasing the number of Lidar measurements and iterations, which is relevant
when modeling various robotics tasks. The obtained results can be significantly improved by choosing
a computing system with more than eight cores.
    Mochurad et al. proposed a parallelization method to speed up an obstacle detection algorithm
based on 2D LiDAR sensor data while processing big data [16]. The result is an algorithm that finds
obstacles and objects with high accuracy and speed. So, Cluster Homogeneity = 86% and Cluster
Completeness = 91%. The authors also investigated the relationship between the selected
hyperparameters' values and the algorithm's efficiency.
    Articles [17-20] consider the comparison of different parallel programming frameworks, such as
MPI, OpenMP, OpenCL, and CUDA. In particular, the article [16] studied parallel programming on
computers with distributed and shared memory. The research is conducted using test cases that show
the need for different approaches to parallel programming. Test cases are implemented in Chapel and
Julia, and MPI and OpenMP are added to C. It is shown that both languages, Chapel and Julia,
represent a viable alternative to Fortran and C/C++, supplemented by parallel programming
frameworks. With their use, the programmer's efficiency is significantly improved, and the speed of
the programs is not considerably affected.
    The authors in [21] consider the possibility of autonomous navigation, where the particle filter
algorithm was applied to obtain the most accurate position of the car. Here, GPS and Lidar were used
for localization. The proposed system provides autonomous control in an unfamiliar environment. It
increases localization accuracy by solving the error accumulation problem in an unconstrained
environment. The particle filter algorithm showed excellent accuracy, with a trajectory error of about
0.094 cm.
   The authors of the article [22] show the use of this algorithm for better space orientation of a robot
engaged in household chores. It was equipped with a 2D Lidar and an RGB camera for environmental
perception.
   In [23], a new method of genetic algorithm parallelization for solving the Traveling Salesman
Problem (TSP) using the CUDA algorithm is presented. The solution provides route information,
excluding all services autonomous vehicles require in the cloud implementation of intelligent
transportation systems.
   The authors of the article [24] presented an effective implementation of the extended Kalman filter
for the simultaneous localization and mapping algorithm on a multiprocessor architecture. The overall
accuracy of the algorithm depends on the number of reference points in the state vector and the
matched observations.
   After conducting a detailed analysis of the problem considered in our work, it was noted that no
one has optimized the Particle Filter algorithm using CUDA technologies.

3. Methods and means of research
    In general, the problem of Lidar localization can be described as follows. A car in the room can
move in any direction with a specific deviation and take measurements of the surrounding
environment. Each subsequent measurement differs in angle from the previous one. It is crucial to
determine the Lidar position in the room in a few steps and with optimal accuracy.
    Articles [25-298] have shown in detail how this algorithm works, where it can be applied, and how
it can be improved by increasing the weight of particles by adding Kalman-filtered global navigation
satellite system (GNSS) information [30].
    Let's consider the algorithm in more detail. The Lidar successively performs two actions. It scans
the space around itself and moves in space on a vector x, y with a certain error.
    It is also important whether the initial position of the Lidar is specified. If so, then the array of all
particles can be filled with these position coordinates. If not, we need to generate some given number
of particles that will be placed randomly in the room. This is done using a Ray Tracing algorithm. The
idea of the algorithm is to count the number of intersections of the room walls with a ray emitted from
a point horizontally to the right. If their number is even, then the point does not lie in this space. The
point is suitable if the intersection number is odd (see Figure 1).


                          a                                                   b

Figure 1: a) – example of a point that lies in the room but the number of intersections is odd;
b) – example of a point that does not lie in the room, and the number of intersections is even

    First, we set the coordinates of the vertices of our room. After that, depending on whether the
initial position is specified, we fill the array with random particles or particles with the initial position
coordinates.
    The next step will be a Lidar scan of the surrounding space (see Figure 2). The scan takes place k
times, starting the movement horizontally and to the right, and each measurement differs from the
previous one by an angle that changes by 360/k degrees. This way, we get an array with the distances
from the Lidar to the room's walls with a given ray launch angle.
Figure 2: An example of the rays emitted by the Lidar, scanning the space around it

   Next, we calculate the weights for each particle. The calculation is as follows:
       1.       Take the first particle from the array.
       2.       Calculate the angle relative to the horizontal, the final coordinate of the end of the
       segment, which starts in the particle coordinates and has the ray length from the array.
        x1  x  l cos  , y1  y  l sin  , where  – ray angle, and l – ray length.
        3.       Calculate the perpendicular length lowered from the end of the segment to the nearest
        room wall.
        4.       Add the square of the length of the perpendicular to the temporary variable.
        5.       Repeat steps 2-4 for each ray in the array.
        6.       The weight of the particle is calculated according to the normal distribution
    formula (1).

                                              1      (𝑥−𝜇)2                                      (1)
                                                    −
                                  𝑓(𝑥) =           𝑒 2𝜎 2 ,
                                            𝜎√2𝜋

         where  – mathematical expectation, 𝜎 – standard deviation that are set by Lidar parameters.
         7.       Repeat steps 1-6 for each particle.
    The next step is to repopulate the particles based on their weights. This procedure is divided into
four stages:
    The first stage is to add up all the values of particle weights.
                                                                                            𝑊[𝑖]
    The second stage is to fill the new array of weights with values equal to 𝑁𝑒𝑤𝑊[𝑖] =          , where
                                                                                             𝑠
 𝑠 – sum of all values. Thus, we normalize them so that their sum is equal to 1.
    The third stage is to fill in an array that has a value 𝑝𝑟𝑒𝑓𝑊[𝑖] = 𝑁𝑒𝑤𝑊[𝑖] + 𝑝𝑟𝑒𝑓𝑊[𝑖 − 1]. In
this way, we will get an array of weight intervals.
    The fourth stage is the generation of random value from 0 to 1. Then we need to determine which
range in the weight array the number belongs to. The index of that range would correspond to the
particle that should be created.
    The complexity of the algorithm described above can be calculated as O(n*(k+r)*q), where
      n – the number of Lidar movement iterations;
      k – the number of measurements at every step;
      r – the number of walls in the room;
      q – the number of generated particles.
    As a result, it grows very rapidly. One of the methods of optimizing this algorithm can be the
parallelization of certain particles. For the most part, the most time is spent on the calculation of
weights and repopulation, so these stages of the algorithm are parallelized. Thus, the complexity of
the algorithm should be significantly reduced.
    To reduce the implementation time and increase the algorithm's efficiency, we will compare the
algorithm execution time to a different number of processor threads using OpenMP technology. We
will also make time measurements on the graphics processor using CUDA technology [31].
   As a result, the #pragma omp parallel method was used. It tells the compiler that the code below
uses inter-thread transmission and provides safe thread processing, if threads exchange information.
The schedule(type[, chunk]) method was also used, specifying how the cycle iterations are distributed
between threads. Synchronization in this case is implicit, and is performed in two situations:
    1. At the end of the parallel region. OpenMP relies on fork-join. When the program starts, one
       thread (main thread) is created. When we create a parallel section using #pragma omp
       parallel, several threads (fork) are created. These threads will work in parallel and, at the end
       of the parallel section, will be destroyed (join). So, at the end of the parallel section, we have
       synchronization and know precisely the status of all threads (they have finished their work).
    2. There is an implicit barrier at the end of some OpenMP designs, such as #pragma omp for.
       Thread can only continue to work once all the threads reach the barrier. It is essential to know
       exactly what work different threads have done.
   For CUDA technology, some weight calculation function is defined as __global__, which
indicates that this code passage will be performed on the graphics processor. Additionally,
calcweights() method will cause auxiliary functions, which will be defined as __device__. It means
that only a graphics processor can cause and perform such functions. Numerous experiments are
conducted on a graphics card with four multi-processor blocks that can perform 512 threads. The
function for determining the block size is cudaOccupancyMaxPotentialBlockSize(), which in a
heuristic manner calculates the block size that reaches maximum occupancy. The thread group is
called the CUDA block. Each CUDA block is performed by one multi-processor (MP) and cannot be
transferred to other multi-processors in GPU. One multi-processor can perform several simultaneous
CUDA blocks, depending on the resources required by CUDA blocks. Each core is performed on one
device, and CUDA supports several cores on one device at a time. In this way, we call calcweights()
function, passing gridSize and blockSize as parameters. Here blockSize is the number of threads on
the block determined by the cudaOccupancyMaxPotentialBlockSize() function. A gridSize is the
number of blocks we calculate by the formula (2).

                          𝑔𝑟𝑖𝑑𝑆𝑖𝑧𝑒 = (𝑁 + 𝑏𝑙𝑜𝑐𝑘𝑆𝑖𝑧𝑒 − 1)/𝑏𝑙𝑜𝑐𝑘𝑆𝑖𝑧𝑒,                                        (2)

where N – the size of the array (in our case it is the number of particles we will generate).

4. Numerous experiments
    This section will feature tables and figures showing the program's time, speed, and efficiency with
the different number of threads. The input was taken from Algotester [32]. The program was tested
for four different input variations [33], with varying room shapes, the number of walls, iterations of
Lidar, and the number of measurements at every step.
    3000 (number of particles) was selected for the most accurate localization q. Table 1 presents the
execution time of a sequential and proposed parallel algorithm based on OpenMP technology in a
variation of the threads number (two, four, and eight) and the values obtained by the acceleration and
efficiency of our parallelization without the distribution between threads. k/n/m – the input data
described above. ts – the time of performing a sequential algorithm, tp1 – the time of execution of the
parallel algorithm based on OpenMP technology, S – acceleration, and E – efficiency. Table 2
presents the same results on the basis of CUDA technology and a graphics processor. Here tp2 – the
time of the parallel algorithm based on CUDA technology.

Table 1
Results of program execution based on OpenMP technology
                                        2 threads                    4 threads                    8 threads
  k/n/m         ts , ms
                              tp1, ms       S        E     tp1, ms      S         E     tp1, ms      S           E

  36/6/4        273.9         139.4       1.96      0.98   106.3       2.57      0.64    90.8      3.021      0.37
 36/15/8        960.6       491.3     1.95       0.97      327.1      2.93   0.74       293.2   3.28   0.40
 100/15/8      2102.4       1055.4    1.99       0.99      735.2      2.85   0.71       541.8   3.88   0.48

180/29/27     29414.8      15040.4    1.95       0.97     8231.9      3.57   0.89       5245    5.60   0.70


   Analyzing the results of Table 1, we see that when performing an algorithm with two threads,
acceleration was obtained close to two, and efficiency also goes to one. Since numerous experiments
were conducted on a four-core processor, we have received reliable results. Acceleration with four
threads was approximately 3.5, and efficiency – 0.89. When using eight threads, it was possible to
improve the acceleration and bring it closer to 5.6. However, the efficiency decreased significantly
and was 0.7. All this is because the number of threads in the last two cases exceeds the number of
cores in the multi-core computer system. The results can be significantly improved by choosing
computers with more cores. Still, the results obtained should have the same trend between the number
of cores and the variation of the threads number.

Table 2
Results of program execution based on СUDA technology
                                                              Graphics processor
                         k/n/m          ts, ms
                                                            tp2, ms           tp2, ms
                         36/6/4         273.9                63.1            4.340729
                         36/15/8        960.6                94.8            10.13291
                        100/15/8        2102.4              113.2            18.57244
                        180/29/27     29414.8               1485.6           19.79995


   In Table 2, acceleration is close to 20. It is known that when using CUDA technology, the
performance is not calculated. Therefore, the proposed parallel algorithm has improved about four
times the acceleration based on CUDA technology compared to OpenMP technology. The results of
Tables 1 and Table 2 for visual analysis are presented in Figure 1 and diagrams (see Figure 2 and
Figure 3).
   Therefore, to objectively evaluate the results, you need to consider the computer's capabilities on
which the experiments were conducted. In particular, in our case, the program was launched on the
computer with the characteristics specified in Table 3.

Table 3
The configuration of the computer on which the numerical experiments were performed

                   Processor                            Intel Core i5-8250

                   Base frequency of the
                                                        1.8 GHz
                   processo
                   RAM                                  32 GB
                   Kernels                              4
                   Logical processors                   8
                   System type                          64-bit)
                   Video card type                      Discrete NVIDIA(940MX)
Figure 1: Execution time of the sequential algorithm (1), parallel based on OpenMP technology for
different number of threads (2, 4 and 8), and parallel based on CUDA technology (m)


Figure 2: Acceleration obtained based on OpenMP technology for different number of threads (2, 4
and 8) and based on CUDA technology (m)
Figure 3: Efficiency obtained based on OpenMP technology for different number of threads (2, 4 and
8) and based on CUDA technology (m)

   The OpenMP library was used to implement a parallel algorithm for determining the current Lidar
position and CUDA technology to execute a specific part of the code on the graphics processor. After
analyzing the results, the following conclusions can be made. The time with the small amount of input
data is not significantly different.
   Acceleration becomes more noticeable with large input data. Looking at the results, it is noticeable
that the time of the sequential algorithm and the time obtained by paralleling the algorithm is
exceptionally distinct. As a result, it was achieved an efficiency of ~ 0.99 (when accelerating the
algorithm using two threads) and an acceleration of ~ 19.8 (when performing the code on the graphics
processor). The error of determining the position was 0.001.
   From the results, we can see that the speed of the algorithm parallelized on the GPU is higher than
that of the usual sequential one. As the parametric complexity of the problem increased, the difference
in execution time grew rapidly, due to the reduction of the total complexity of the algorithm (taking
into account the synchronization time) and the parallel use of CUDA microkernels for computational
operations, as expected.

5. Conclusions
   Determining the current Lidar position is a task that requires taking into account many factors,
and, therefore, requires a lot of time to execute the algorithm. In our work, this problem was
addressed by parallelizing calculations on a graphics processor using CUDA technology, thanks to
which we managed to consistently obtain significantly better execution time indicators (with an
increase in the number of computational loads) than in the sequential implementation of the algorithm
on a central processor.
   So, this paper proposes and investigates the parallel algorithm for determining the current Lidar
position using CUDA technologies. The results of using OpenMP technology are also compared. As a
result, it was possible to speed up the process about twice times and achieve an efficiency of 0.99.
The algorithm is accelerated almost four times, and an efficiency of 0.98 was obtained. The
acceleration reached approximately 19.8 by using CUDA technology for parallel programming. Better
results can be achieved in the future based on the proposed parallel algorithm using more powerful
computing systems.
   Prospects for the further development of this study are the application of the algorithm proposed in
the work in the case of 3D Lidar [34, 35].

6. References
    [1] M. U. Khan, S. A. A. Zaidi, A. Ishtiaq, S. U. R. Bukhari, S. Samer and A. Farman. “A
        Comparative Survey of LiDAR-SLAM and LiDAR based Sensor Technologies," 2021
        Mohammad Ali Jinnah University International Conference on Computing (MAJICC),
        Karachi, Pakistan, 2021, pp. 1-8, doi: 10.1109/MAJICC53071.2021.9526266.
    [2] S. Chen, B. Zhou, C. Jiang, W. Xue, Q. Li. “A LiDAR/Visual SLAM Backend with Loop
        Closure Detection and Graph Optimization”. Remote Sensing. 2021; 13(14):2720.
        doi:10.3390/rs13142720.
    [3] Wolfgang Hess, Damon Kohler, Holger Rapp, Daniel Andor. “Real-Time Loop Closure in 2D
        LIDAR SLAM.” 2016 IEEE International Conference on Robotics and Automation (ICRA),
        pp. 1271-1278.
    [4] I. Oleksiv, H. Lema, V. Kharchuk, T. Lisovych, O. Dluhopolskyi and T. Dluhopolska,
        “Identification of Stakeholders Importance for the Company’s Social Responsibility using the
        Analytic Hierarchy Process”, 10th International Conference on Advanced Computer
        Information Technologies (ACIT), 2020, pp.573-576
    [5] J. Elfring, E. Torta, R. van de Molengraft. “Particle Filters: A Hands-On
        Tutorial”. Sensors, 2021, 21, 438. doi:10.3390/s21020438.
    [6] Jin Yanga , Xuerong Cuib, Juan Lia, Shibao Lib, Jianhang Liua, Haihua Chena. “Particle
        filter algorithm optimized by genetic algorithm combined with particle swarm optimization.”
        Procedia Computer Science, 187, 2021, pp. 206–211.
    [7] L. Mochurad, H. Lema, R. Vilhutska. “Parallel Algorithms Assessment Usage of Image’s
        Segmentation Quality in Medicine”. CEUR Workshop Proceedings, 2022, 3171, pp. 1509–
        1519.
    [8] R. S. Dehal, C. Munjal, A. A. Ansari and A. S. Kushwaha, "GPU Computing Revolution:
        CUDA," 2018 International Conference on Advances in Computing, Communication
        Control and Networking (ICACCCN), Greater Noida, India, 2018, pp. 197-201, doi:
        10.1109/ICACCCN.2018.8748495.
    [9] G. Xie. “A novel Monte Carlo simulation procedure for modelling COVID-19 spread over
        time”. Sci Rep 10, 13120, 2020. doi:10.1038/s41598-020-70091-1.
    [10]          P. Andreo. “Monte Carlo simulations in radiotherapy dosimetry”. Radiat Oncol 13,
        121, 2018. doi:10.1186/s13014-018-1065-3.
    [11] M. B. Alatise, G. P. Hancke. “A Review on Challenges of Autonomous Mobile Robot and
        Sensor Fusion Methods”, in IEEE Access, 2020, Volume 8, pp. 39830-39846, doi:
        10.1109/ACCESS.2020.2975643.
    [12] K. Groves, E. Hernandez, A. West, T. Wright, B. Lennox. “Robotic Exploration of an
        Unknown Nuclear Environment Using Radiation Informed Autonomous Navigation”.
        Robotics, 2021, 10(2), 78, doi:10.3390/robotics10020078.
    [13] C. Pang, X. Zhong, H. Hu, J. Tian, X. Peng, J. Zeng. “Adaptive Obstacle Detection for
        Mobile Robots in Urban Environments Using Downward-Looking 2D LiDAR”, Sensors,
        2018, 18, 1749, doi:10.3390/s18061749.
    [14] L. Mochurad, Y. Hladun, R. Tkachenko. An Obstacle-Finding Approach for Autonomous
        Mobile Robots Using 2D LiDAR Data. Big Data Cogn. Comput. 2023, 7, 43, doi:10.3390/
        bdcc7010043.
    [15] L. Mochurad, N. Kryvinska. “Parallelization of Finding the Current Coordinates of the Lidar
        Based              on            the            Genetic           Algorithm              and
        OpenMP Technology”. Symmetry 2021, 13, 666. doi:10.3390/sym13040666.
[16] L. Mochurad, Y. Hladun, R. Tkachenko R. “An Obstacle-Finding Approach for
    Autonomous Mobile Robots Using 2D LiDAR Data”. Big Data and Cognitive Computing.
    2023; 7(1):43, doi:10.3390/bdcc7010043.
[17] R. Novosel, B. Slivnik. “Beyond Classical Parallel Programming Frameworks: Chapel vs
    Julia”. 8th Symposium on Languages, Applications and Technologies (SLATE 2019), 2019,
    doi: 10.4230/OASIcs.SLATE.2019.12.
[18] A. Shterenlikht, L. Cebamanos. “MPI vs Fortran coarrays beyond 100k cores: 3D cellular
    automata”.     Parallel    Computing,     Volume 84,       May 2019,          Pages    37-49,
    doi:10.1016/j.parco.2019.03.002.
[19] Baldomero Imbernón, Javier Prades, Domingo Giménez, José M. Cecilia, Federico Silla.
    Enhancing large-scale docking simulation on heterogeneous systems: An MPI vs rCUDA
    study, Future Generation Computer Systems, Volume 79, Part 1, 2018, 26-37,
    doi.org/10.1016/j.future.2017.08.050.
[20] M. Breyer, A. Van Craen, D. Pflüger. “A Comparison of SYCL, OpenCL, CUDA, and
    OpenMP for Massively Parallel Support Vector Machine Classification on Multi-Vendor
    Hardware”. In International Workshop on OpenCL (IWOCL'22). Association for Computing
    Machinery, New York, NY, USA, 2022, Article 2, 1–12. doi: 10.1145/3529538.3529980.
[21] J. Patoliya, H. Mewada, M. Hassaballah, M.A. Khan, S. Kadry. “A robust autonomous
    navigation and mapping system based on GPS and LiDAR data for unconstraint
    environment”, Earth Science Informatics, vol. 15, no. 4, pp. 2703–2715, 2022.
    doi:10.1007/s12145-022-00791-x.
[22] L. Zhi-jie, C. Ming-yu, Z. Zhi-ming and Y. You-ling, "Design and Implementation of Home
    Service Robot," 2020 Chinese Automation Congress (CAC), Shanghai, China, 2020, pp.
    3541-3546, doi: 10.1109/CAC51589.2020.9327151.
[23] M. Abbasi, M. Rafiee, M. R. Khosravi et al. “An efficient parallel genetic algorithm
    solution for vehicle routing problem in cloud implementation of the intelligent transportation
    systems”. J Cloud Comp 9, 6, 2020, doi:10.1186/s13677-020-0157-4.
[24] N.B. Amor, M. Baklouti, K. Barhoumi, M. Jallouli. “Efficient embedded software
    implementation of a low cost robot localization system”. In Proceedings of the 2019 IEEE
    International Conference on Design & Test of Integrated Micro & Nano-Systems (DTS),
    Gammarth, Tunisia, 28 April–1 May 2019; pp. 1–5.
[25] KyungJae Ahn, Yeonsik Kang. "A Particle Filter Localization Method Using 2D Laser
    Sensor Measurements and Road Features for Autonomous Vehicle", Journal of Advanced
    Transportation, vol. 2019, Article ID 3680181, 11 pages, 2019. doi:10.1155/2019/3680181.
[26] D. Talwar and S. Jung, "Particle Filter-based Localization of a Mobile Robot by Using a
    Single Lidar Sensor under SLAM in ROS Environment," 2019 19th International Conference
    on Control, Automation and Systems (ICCAS), Jeju, Korea (South), 2019, pp. 1112-1115,
    doi: 10.23919/ICCAS47443.2019.8971555.
[27] M.Á. de Miguel, F. García, J.M. Armingol. Improved LiDAR Probabilistic Localization for
    Autonomous Vehicles Using GNSS. Sensors 2020, 20, 3145. doi:10.3390/s20113145.
[28] Li Lin, Rong Wei, Su Fei, Xing Xiaoyu. “An improved detection method based on
    morphology and profile analysis for bridge extraction from Lidar”. Optics & Laser
    Technology, Volume 121, January 2020, 105790. doi: 10.1016/j.optlastec.2019.105790.
[29] Xieyuanli Chen, Ignacio Vizzo, Thomas Labe, Jens Behley, Cyrill Stachniss. “Range
    Image-based LiDAR Localization for Autonomous Vehicles”. Computer Science, Robotics,
    2021, doi.org/10.48550/arXiv.2105.12121.
[30] Y. Kim and H. Bang, “Introduction to Kalman Filter and Its Applications,” Introduction and
    Implementations of the Kalman Filter, May 2019, doi: 10.5772/intechopen.80600.
[31] Martin Heller. “What is CUDA? Parallel programming for GPUs”, 2018. https://
    www.infoworld.com/article/3299703/what-is-cuda-parallel-programming-for-gpus.html.
[32] “Lidar robot localization 1”. https://algotester.com/uk/ContestProblem/DisplayWithEditor/
    83112 (accessed March 12, 2023)
[33] “Data Lidar”. https://www.researchgate.net/publications/create? publicationType=dataset.
    (accessed March 12, 2023).
[34] Y. Liu, C. Wang, H. Wu, Y. Wei, M. Ren, C. Zhao. Improved LiDAR Localization
    Method for Mobile Robots Based on Multi-Sensing. Remote Sens. 2022, 14, 6133,
    doi:10.3390/rs14236133.
[35] J. Gómez, O. Aycard, J. Baber. Efficient Detection and Tracking of Human Using 3D
    LiDAR Sensor. Sensors (Basel). 2023 May 12;23(10):4720. doi: 10.3390/s23104720.
    PMCID: PMC10222621.