<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Monocular Mapping and Localization of Urban Road Scenes Based on Parameterized Semantic Representation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fengsheng Ding</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinchun Ji</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dongyan Wei</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jingyu Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kai Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hong Yuan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aerospace Information Research Institute Chinese Academy of Sciences No.</institution>
          <addr-line>9 Dengzhuang South Road, Haidian District, Beijing 100094</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Shanghai Astronomical Observatory, Chinese Academy of Sciences</institution>
          ,
          <addr-line>No.80 Nandan Road, Shanghai 200030</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Chinese Academy of Sciences</institution>
          ,
          <addr-line>No.19A Yuquan Road, Shijingshan District, Beijing 100049</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The semantic map plays an essential role in unmanned tasks with limited resources for urban road scenes, especially for precise localization in the GNSS-blocked area. With the advantages of low cost, rich information acquisition and wide depth range, monocular cameras have received much attention and research in semantic mapping. Most of the current monocular mapping methods use dense point clouds to represent semantic features, which is insufficient of the compactness in feature representation, faces the pressure from storage and computational resources for common autonomous vehicles, and lacks multidimensional scene elements. Based on a low-cost monocular camera, this paper proposes a parametric semantic mapping algorithm for multi-dimensional features in road scenes. Meanwhile, a semantic mapbased monocular camera matching localization algorithm is proposed. The experiment results on KAIST Urban Dataset show that the root mean square error of localization on x and y is about 0.3m, and the map size is compressed to 8.8 kB/km. Achieve map lightweight while meeting the positioning accuracy requirements for autonomous vehicle tasks under urban road environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;semantic mapping</kwd>
        <kwd>parameterized features</kwd>
        <kwd>map-based localization</kwd>
        <kwd>autonomous vehicles 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recently, autonomous vehicles (AVs) based services such as Robo-taxi and autonomous valet parking
(AVP) have been developing rapidly, in which self-localization is a primary component.
Highdefinition map (HD map) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and dense point cloud map [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] can achieve centimeter-level positioning
for these services, but the resource required to build, update and use prevent them from being widely
used in AVs. Benefit from the richness, long-term existence and insensitivity to viewpoint, lighting
and weather conditions of semantic features in the urban road scenes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], semantic map can help
achieve accurate localization in GNSS-block area and realize the lightweight of the map to some
extent, friendly to the storage and computing resources of consumer-grade vehicles, which is
conducive to the promotion of use. Crowdsourced mapping in large urban scale scenarios is an
efficient way to build semantic map that can extend the mapping area and provide map updates at
any time through local semantic mapping on consumer-grade vehicles [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        A monocular camera is a low-cost sensor that can be deployed on consumer-grade vehicles rapidly
and widely, offering the advantage of rich information acquisition and a wide depth range for use in
local semantic mapping [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, in the work of monocular semantic mapping for localization
tasks, semantic features are currently still presented in the map as dense point clouds, while point
cloud stitching is one of the necessary procedures [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These methods’ first disadvantage is that it
pressures storage and computational resources. On the other hand, the compactness of the feature
representation needs to be improved, and blurred feature edges are the most direct manifestation,
which makes map-based localization face higher uncertainty. Meanwhile, many studies now use
various semantic features in maps for monocular localization [
        <xref ref-type="bibr" rid="ref10 ref7 ref8 ref9">7-10</xref>
        ], most of the current mapping
studies are confined to single or partial features on the road surface, which limits the constraint
capability provided during localization. For example, Zhou et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Cheng et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
investigated different ways of feature representation in maps, such as deep key points and partial
parameterization, with applicability limitations and lack of reconstruction of semantics at the partial
road because of reconstruction difficulty. Wen et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] reconstruct the roadside poles but neglect
the road surface features. Mapping with multidimensional semantic features in urban road scenarios
needs to be addressed further.
      </p>
      <p>To address the above problems, we propose a novel monocular semantic mapping method for
urban road scenes: multidimensional semantic mapping based on parameterized features. The
proposed monocular mapping system can be performed on consumer-grade vehicles focused on road
markings and spatial pole features. The contributions of this paper are as follows:
x A method for parametrically characterizing the semantics of urban road scenes is proposed,
and feature representation satisfies a stronger compactness while realizing the lightweight of the
map.
x A monocular mapping method that minimizes the semantic observation error is proposed to
realize the construction of parametric semantic maps with multidimensional features.
x A localization verification system based on the constructed semantic maps to evaluate the
accuracy and usability of the maps.</p>
      <p>The paper is organized as follows: Section II presents the system flowchart and a brief overview
of the monocular mapping and localization algorithms; Section III introduces parametric
characterization and mapping algorithm; Section IV is an experimental validation of the proposed
method; Section V summaries the study and discusses future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. System overview</title>
      <p>The system flowchart of the proposed algorithm is shown in Figure 1, including two parts: monocular
mapping and localization verification. The monocular mapping includes three modules of feature
extraction and preprocessing, local reconstruction and global mapping. Semantic features are first
extracted and preprocessed; the local reconstruction module implements feature parameter
initialization, 2D association and optimal recovery of spatial location. Localization verification
implements a map matching positioning system for map evaluation.</p>
      <p>Feature Extraction
and Preprocessing</p>
      <p>Local
Reconstruction</p>
      <p>Mapping</p>
      <p>Semantic
Segmentation</p>
      <p>Object
Detection</p>
      <p>Preprocess
Monocular
Camera</p>
      <p>Image
Sequences</p>
      <p>Ground</p>
      <p>Truth
Wheel
IMU
odom</p>
      <p>Localization</p>
      <p>Sliding Window
Semantic
Extraction
Matched</p>
      <p>Pose</p>
      <p>Local Map
Update</p>
      <p>2D-3D</p>
      <p>Association
t</p>
      <p>Pose
Optimization
t-1</p>
      <p>Temporal</p>
      <p>Association
Poles</p>
      <p>Ground
semantics
Parameterizatio
n</p>
      <p>Semantic</p>
      <p>Map</p>
      <p>Initialization</p>
      <p>UKR6*</p>
      <p>Local
Optimization</p>
      <p>Global
Mapping</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Feature extraction and preprocessing</title>
        <p>
          The semantic features are divided into two categories: road surface features and roadside pole
features. They are extracted through semantic segmentation and object detection network,
respectively. DeepLabv3+ [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] segments the image at the pixel level for road surface features for road
surface features. We segment six semantics on road surfaces: slowdowns; arrows; crosswalks; road
lines (both solid and broken lines); stop lines; numbers; texts. Sem-LSD [15] is used to detect the poles
in the image. Sem-LSD encodes advanced semantic information, which is more robust in matching
associations in complex urban environments. Sem-LSD detects a pole as a Bounding Box, which is
vertical and 2 to 3 pixels wide in the image, and its centerline is used to represent the poles.
        </p>
        <p>Due to the generalization ability of the model, there may be false detections in the feature
extraction. We use post-processing based on the results of geometric attributes to reduce the impact
of segmentation errors on the mapping accuracy.</p>
        <p>y
y
h
w</p>
        <p>T</p>
        <p>T</p>
        <p>pc
(a)
pc
x
x</p>
        <p>y
h
y
h
w
pc</p>
        <p>T
(b)
pc
T
(d)
x
x</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Feature parameterization</title>
        <p>We classify the extracted semantics into three categories based on their shape and geometric
properties:
x Road features 1 (slowdowns and broken lines).
x Road features 2 (stop lines and crosswalks).
x Pole features (poles).</p>
        <p>The road lines obtained by semantic segmentation are divided into broken lines and solid lines
according to length, only the former being used. Next, the semantics of the different priors were
parameterized separately.</p>
        <p>Road features 1 It can be observed entirely for slowdowns and broken lines so that the
parameterization can represent the whole shape. As shown in Figure 2(a) and (b), the minimum
enclosing box is first fitted according to the feature point cloud in the XOY plane. Then the centroid
pc (xc , yc , 0) , the size (w, h) , and the yaw angleT of the box are estimated in the local vehicle frame. In
summary, six parameters are used to represent the road features 1 under local vehicle frame, denoted
as:</p>
        <p>P [xc , yc ,T , w, h]
Together with the local pose T  SE(3) stored in the map.</p>
        <p>Road features 2 The lateral range is vast for stop lines and crosswalks, which generally exist at
road junctions. The full view is usually unavailable due to the limited camera view and vehicle
occlusion. Therefore, only the longitudinal attributes are usually stored as map elements, as shown
in Figure 2(c) and (d). The width h , centroid pc (xc , , 0) and yaw angleT of the feature in the
longitudinal direction are first calculated based on the feature point cloud in the XOY plane. As a
result, the properties of road features 2 are recorded using three parameters, denoted as:</p>
        <p>P2 [xc , T , h] (2)
also include local pose T .</p>
        <p>Pole features Considering the computational unfriendliness of the orthogonal representation, we
use Plücker coordinates to represent a straight line in the space for poles [16], as shown in Figure 3(a).
For a straight pole line $ in the camera frame, its Plücker coordinate is</p>
        <p>$ nF , vF F , nF v 0 (3)
where n is the normal on the plane formed by the pole and the origin, and v is the direction
vector of the pole. The distance from the pole to the camera's optical center can be calculated from
the Plücker coordinates as d &amp; n&amp; &amp; &amp; v . For spatial poles, there is a geometric prior perpendicular to
the XOZ plane in the camera frame, so the original Plücker coordinates can be expressed in a simple
way. First, the direction vector can be simplified to v (0,1, 0) , and the normal vector in Plücker
coordinates is obtained by:
n d (cosT , 0, siTn )
$
nF , vF F
(d cosT , 0, d siTn ˈ 0, 1, 0)
(1)
(4)
Therefore, the parameters</p>
        <p>P3 [d, T ] (5)
are used to represent the pole in the local camera frame, combined with the local pose to represent
poles in the map.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Temporal association and initialization</title>
        <p>The reconstruction of each 2D semantic feature needs three processes: temporal association,
initialization, and local optimization. The temporal association establishes the associations of each
semantic feature between image sequences. The initialization completes a transformation of the
feature from 2D to 3D space and obtains the initial values for parameterization and local
reconstruction. Pole features perform the 2D-2D temporal associations first, then initialized by
multiview reconstruction. On the contrary, the road features can obtain rough world coordinates by
inverse perspective mapping (IPM) [17], so they can be initialized directly.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.3.1. Initialization</title>
        <p>Initialization obtains the initial values of the parameters in 3D space for each observed 2D feature.
Road surface features include road feature 1 and 2. The best reference frame is selected in the
observation sequence, then the point cloud of the semantic features in the vehicle frame is obtained
by IPM, as shown in Figure 4. With the assumption that the ground in front of the vehicle is a plane:
(6)
p</p>
        <p>KTcvPw
where p (u, v) and P ( X v ,Yv , 0,1) represent corresponding point in the image and 3D space, K is
the camera intrinsic matrix, R and t form the camera extrinsic matrix with respect to the vehicle’s
center, and M represents the projection matrix from the vehicle frame to the image plane. After the
inverse projection, all pixels of each semantic feature are mapped to the corresponding 3D spatial
location, and a dense point cloud representation of the semantic feature in the local vehicle frame is
obtained. Then the initial values of the feature parameters are obtained from the dense point cloud as
described in Section 3.2.</p>
        <p>Oi
y</p>
        <p>ni
Ci
y
Ci</p>
        <p>Oi
z
x
z
x
z
y
si
zC$i
ei
x
D
s</p>
        <p>esi
p z
i i ezi</p>
        <p>D
e</p>
        <p>S Ci
di( j) S Cj
j</p>
        <p>nj
C j
z j</p>
        <p>C j</p>
        <p>Oj
(c)
Figure 3: Parametric representation of pole feature. (a) Plücker coordinates of pole lines in camera
frame; (b) Parameterization initialization of poles; (c) Reprojection error of pole feature.</p>
        <p>Camera Frame
x c</p>
        <p>zc
yc
Vehicle Frame
zv yv</p>
        <p>xv
Road Plane
(7)
(8)
(9)</p>
        <p>Pole feature Limited by the working mechanism of the monocular camera, we cannot recover the
spatial position of the pole from a single frame. Two suitable reference frames in the observation
sequence are selected for initialization, and the initial values of the pole parameters in local camera
frame are obtained by calculating the intersection line of two planes. As shown in Figure 3(b), the
pole is observed in the reference zC$i and zC$j . Using the reference frame Ci as the local camera frame,
origin O (xo , yo , zo )T , we can determine a planeS Ci S ª¬ S x ,S S y , z , w¼º T :
««ª SS xy»»º
«¬ S z»¼
S x x xo S y y yo S</p>
        <p>z z zo</p>
        <p>The initial values of the pole feature are obtained by intersecting the plane under reference frames
S Ci andS Cj :</p>
        <p>L* «¬ª [dv]Fu 0v¼»º S CiS ( Cj ) TS S Cj ( Ci )T  \ 4u 4
The v and d obtained from L* are used as the initial values of the parameters of pole features.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.3.2. Temporal association</title>
        <p>Temporal association establishes 2D correspondences of the same feature under different
observations. We consider different schemes for the two types of semantics.</p>
        <p>Road surface features Semantic point clouds of road features under the local frame are obtained
by IPM and converted to global coordinates with reference pose. The same feature observed at
different moments locates at the approximate position under the world frame, so the feature
association between different moments is achieved directly by comparing the distance in the world
frame. For slowdowns, stop lines and crosswalks, which occur infrequently and have independent
signatures in a short period, the temporal association can be accomplished by Euclidean distance
discrimination in the global frame. For broken lines, which appear frequently and are close to each
other in a multi-lane environment with high confusion. Firstly, all broken lines point clouds under
the global frame are segmented by DBSCAN clustering algorithm [18] to get different broken line
instances. Next, the geometric and spatial constraints are combined to determine whether there is a
possibility of establishing an association between each 2D observed broken lines and all segmented
instances using multiple conditions, including angle, length and overlap rate.</p>
        <p>Pole feature We use classical object tracking algorithm SORT (Simple Online and Realtime
Tracking) [19] in computer vision to track the pole in temporal order for 2D associations.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.4. Local optimization</title>
        <p>Known the initial parameters and 2D observations of features, the optimal parameters are
obtained by minimizing the reprojection error between feature in the image and spatial, which is used
as the final parameters in the map. The cost function Pm of the nonlinear optimization for the feature
parameter in the corresponding local frame is:</p>
        <p>Pˆm ¦ jNz1 arg mPiin(rmdis )T (z mj , Pm )(Ȉ dmis ) 1 rmdis (z mj , Pm ) (10)
(rmang )T (z mj , Pm )(Ȉ amng ) 1 rang (z mj , Pm )
m
where m is the semantic categories with the value 1, 2 or 3. Pm represents the parameters of the
semantic features. z mj is the jth 2D observation of feature m , Nz is the number of 2D observations, Ȉ m is
the covariance. rmdis (z mj , Pm ) is the distance residual, while rmang (z mj , Pm ) is the angular residual.</p>
        <p>For road features 1, the distance error between the feature and the observation is designed as the
distance of the four vertices of the feature in the image plane. For road features 2, limited by the
camera view, only the offset on the y-direction of the image is used:
ridis (z j , P )
p j</p>
        <sec id="sec-3-6-1">
          <title>KTrief P</title>
          <p>Di (z j , p j ), i 1, 2
(11)
(12)
(13)
(l1,l2 ,l3 )
where P is the parameter of the road features, D1(z j , p j ) is the distance between the observation of j
frame and the four vertices of the projected image plane, D2 (z j , p j ) is the coordinate difference and
width difference between the observation of the j frame and the projection of the center point y in
the vertical direction of the image plane. Trief is the relative pose of the reference frame and observation
frame, K is the camera intrinsic matrix.</p>
          <p>For roadside pole features, the feature is projected to the observed camera frame firstly. The
distance error function is designed as the distance from the endpoint of the 2D observation line in the
normal plane to the projection line, as shown in Figure 3(c).
where Ds (sz j , p j ) and De (ez j , p j ) denote the distances from 2D line endpoints to 3D line, respectively.
pi
r3dis (zi , P )</p>
          <p>Ds (szi , pi ) De (ezi , pi )
Kc TvcTrief TcvP
Ds (szi , pi )
De (ezi , pi )
s l
zi pi</p>
        </sec>
      </sec>
      <sec id="sec-3-7">
        <title>3.5. Localization verification</title>
        <p>Based on the a priori semantic point cloud map, we evaluate the accuracy and usability of the map
by matching localization. Considering the single-frame semantic sparsity and perspective distortion
filtering some features, to ensure the richness of semantic features and robustly optimize the pose,
we use sliding window strategy for localization, the initial pose Tt is obtained by combining the
estimated pose at time t 1 with the odometry. Then the local map is obtained and matching pairs are
established by the projection distance according to (11) and (12). According to the parameterization
of different features, 2D-3D matching pairs are established using Euclidean distance and point-to-line
distance under the vehicle frame. Based on the matched pairs, the current frame’s pose to be
optimized can be recurred to the remaining frames in the window, then minimize the error between
the features in the map and in each window frame to obtain the optimal poses Tˆt .</p>
        <p>Tt</p>
        <sec id="sec-3-7-1">
          <title>TˆtTtt 1</title>
          <p>Tˆt
arg min ¦
i t Nw
t</p>
          <p>r((TtTit ) 1, Mi )
Mi ^ S cj , scj`
j 1, , Ni , c 1,   , Nc
(14)
(15)
(16)
where Nc , Nw and Ni are the number of semantic categories, the sliding window size, and 2D-3D
matching pairs in window i . Residual function r refers to Section 3.4. S cj and scj are the map primitives
and 2D features in the matching pairs, and Ttt 1 is the pose provided by odometry.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental results</title>
      <p>4.1. Setup</p>
      <p>The KAIST Urban Dataset [20] is a publicly available dataset of urban road scenarios, and we
conduct experiments on urban26 sequence with a total length of 3.98 km. In this paper, the images
acquired by the left camera and ground truth are used for mapping. The semantic mapping process
is implemented offline, and localization is done in real-time.</p>
      <p>(a) (b)
Figure 5: KAIST Complex Urban dataset. (a) is the collection car of the data set, (b) is the route of the
data used in the experiment in Google Map.</p>
      <sec id="sec-4-1">
        <title>4.2. Semantic Mapping</title>
        <p>
          Extracting 2D semantic features from raw images is the first step of mapping. DeepLabv3+ and
Sem-LSD are used to extract road semantics and roadside pole semantics, respectively. Due to the
lack of annotated data, DeepLabv3+ is first pre-trained on Cityscapes, and then trained on SeRM
dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to fine-tune the network to obtain an accurate semantic inference model based on the
MMSegmentation [21] platform. The labels trained using DeepLabv3+ are slowdowns, arrows,
crosswalks, road lines, stop lines, numbers and texts. Considering the accuracy and other issues, only
slowdowns, crosswalks, road lines and stop lines are used for mapping. For roadside pole features,
use the model labeled and trained on KAIST urban dataset to detect poles.
        </p>
        <p>Considering the influence of perspective effect, the distant ground is more distorted in the image,
so we preserve semantic features in the ROI region from 2 to 20m relative to the vehicle center. When
parameterization, considering that the roads are generally in a straight line when slowdowns,
crosswalks and stop lines appear, parameterT of these three types of features is set to 0° in this paper
for the convenience of reconstruction. While Broken lines, due to the large number and common at
curves, this parameter cannot be neglected and has some influence on localization, so the angle is
estimated during initialization. In the temporal association, we expand the width of the target box by
20 pixels before tracking.</p>
        <p>The final semantic map is shown in Figure 7, the size of the map built on a road of length 3.98km
is only 35.2kB. The feature parameterization is completed with a lightweight map, while there are
five semantic categories in the map, including slowdowns, crosswalks, stop lines, broken lines and
poles, with numbers of 36, 23, 21, 506, and 153, respectively.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Localization Verification</title>
        <p>
          Two matching methods are used to verify the accuracy and usability of the maps: using ground
features with slowdowns and broke lines (match-G: Ground); using multi-dimensional features with
round semantic features and poles (match-GP: Ground and Pole). The accuracy of the map is
evaluated by the error between matching optimized pose and ground truth. We focused on the
localization error in x, y and yaw angle. At the same time, to verify the efficiency and accuracy of our
map, we compare it with the semantic point cloud ICP localization method used in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] uses the
semantic point cloud map and the local point cloud generated by IPM for ICP matching, and combined
with the odometry to achieve 6 DOF localization. For the convenience of verification, based on the
fact that the change of roll and pitch angle is much smaller than the yaw angle when driving in the
road scene, we evaluate accuracy according to the error between matching results and ground truth
on the translation and yaw angle.
        </p>
        <p>Figure 8 shows localization errors distribution of match-GP with trajectory. From the whole map,
we can see that the localization error is about 0.3m, while some positioning points have larger errors,
usually at intersections that lack road semantic features. Figure 9 shows the results of match-GP in
the semantic map, we can find localization results always keep a small jump around the real trajectory.</p>
        <p>
          With the blessing of multidimensional features, our map can also provide pole features constraints
in addition to ground semantics. As can be seen in Table 1, compared with match-G, the positioning
error of match-GP on x and y is about 0.3m, and the yaw angle error is less than 0.5瀽 . After adding
pole constraints, there are 5% and 12% improvements in x and y, and a small improvement in yaw. As
shown in the Figure 10, the positioning frequency is greatly increased while the positioning accuracy
is improved, which is concentrated in a lower region. The number of positioning points in match-GP
is 3.4 times that of match-G, which provides smoother positioning results. In summary, we
parameterize the multidimensional features in the environment to build a lightweight map. We
conducted experiments using point cloud map localization using the method in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Compared to that
which can achieve ten-centimeter-level localization accuracy for autonomous vehicle services, the
demand for network bandwidth, storage, and computing resources is remarkably relaxed. The
realtime performance of the algorithm is analyzed, and compared with many time-consuming operations
on point clouds in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], our method is faster and less resource-dependent. At the same time, it can still
achieve decimeter-level positioning, which satisfies the localization accuracy requirements for
autonomous vehicle services (e.g., Robo-taxi, unmanned delivery vehicles, etc.) in urban road scenes.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Y</p>
      <p>In this paper, we propose a parameterized multidimensional semantic mapping method that relies
on a monocular camera to achieve lightweight mapping in urban road environments. Decimeter-level
localization accuracy using multidimensional features is achieved, while the kB level storage per
kilometer makes it possible to apply to AV tasks with limited hardware resources. In future work, we
hope to make better use of the map by applying other map semantics and boundary consistency in
localization.
[15] Y. Sun, X. Han, K. Sun, B. Li, Y. Chen, M. Li, Sem-LSD: A Learning-based Semantic Line Segment</p>
      <p>Detector, arXiv preprint arXiv:1909.06591 (2019).
[16] Y. He, J. Zhao, Y. Guo, W. He, K. Yuan, PL-VIO: tightly-coupled monocular visual̢ inertial
odometry using point and line features, Sensors 18.4 (2018) 1159. doi:10.3390/s18041159.
[17] M. Bertozz, A. Broggi, A. Fascioli, Stereo inverse perspective mapping: theory and applications,</p>
      <p>Image Vis. Comput. 16.8 (1998) 585̢ 590. doi:10.1016/s0262-8856(97)00093-0.
[18] M. Ester, H. Kriegel, J. Sander, X. Xu, A Density-Based Algorithm for Discovering Clusters in
Large Spatial Databases with Noise, in: 1996 knowledge discovery and data mining(kdd), AAAI
Press, 1996, pp. 226̢ 231.
[19] A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft, Simple online and realtime tracking, in: 2016 IEEE
international conference on image processing (ICIP), IEEE, 2016. doi:10.1109/icip.2016.7533003.
[20] J. Jeong, Y. Cho, Y.-S. Shin, H. Roh, A. Kim, Complex urban dataset with multi-level sensors from
highly diverse urban environments, Int. J. Robot. Res. 38.6 (2019) 642 ̢ 657.
doi:10.1177/0278364919843996.
[21] MMSegmentation Contributors. MMSegmentation: OpenMMLab Semantic Segmentation
Toolbox and Benchmark, 2020. URL: https://github.com/open-mmlab/mmsegmentation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Zhang,</surname>
          </string-name>
          <article-title>High definition map for automated driving: overview and analysis</article-title>
          ,
          <source>J. Navig. 73.2</source>
          (
          <year>2019</year>
          )
          <fpage>324</fpage>
          --
          <lpage>341</lpage>
          . doi:
          <volume>10</volume>
          .1017/s0373463319000638.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <article-title>Onboard sensors-based selflocalization for autonomous vehicle with hierarchical map</article-title>
          ,
          <source>IEEE Trans. Cybern</source>
          . (
          <year>2022</year>
          )
          <volume>1</volume>
          ̢
          <fpage>14</fpage>
          . doi:
          <volume>10</volume>
          .1109/tcyb.
          <year>2022</year>
          .
          <volume>3155724</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Schonberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pollefeys</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Geiger</surname>
          </string-name>
          , T. Sattler,
          <article-title>Semantic visual localization, in: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</article-title>
          , IEEE,
          <year>2018</year>
          . doi:
          <volume>10</volume>
          .1109/cvpr.
          <year>2018</year>
          .
          <volume>00721</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wijaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Crowdsourced road semantics mapping based on pixel-wise confidence level</article-title>
          ,
          <source>Automot. Innov. 5</source>
          .
          <issue>1</issue>
          (
          <year>2022</year>
          )
          <volume>43</volume>
          ̢
          <fpage>56</fpage>
          . doi:
          <volume>10</volume>
          .1007/s42154-021- 00173-x.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>A light-weight semantic map for visual localization towards autonomous driving</article-title>
          ,
          <source>in: 2021 IEEE international conference on robotics and automation (ICRA)</source>
          , IEEE,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1109/icra48506.
          <year>2021</year>
          .
          <volume>9561663</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hyun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>A lane-level road marking map using a monocular camera</article-title>
          ,
          <source>IEEE/CAA J. Autom. SIn. 9</source>
          .
          <issue>1</issue>
          (
          <year>2022</year>
          )
          <volume>187</volume>
          ̢
          <fpage>204</fpage>
          . doi:
          <volume>10</volume>
          .1109/jas.
          <year>2021</year>
          .
          <volume>1004293</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Gao,
          <article-title>BSP-MonoLoc: basic semantic primitives based monocular localization on roads</article-title>
          ,
          <source>in: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS)</source>
          , IEEE,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1109/iros51168.
          <year>2021</year>
          .
          <volume>9636321</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wijaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          , TMϢ Loc:
          <article-title>tightly-coupled monocular map matching for high precision vehicle localization</article-title>
          ,
          <source>IEEE Trans. Intell. Transp. Syst</source>
          . (
          <year>2022</year>
          )
          <volume>1</volume>
          ̢
          <fpage>14</fpage>
          . doi:
          <volume>10</volume>
          .1109/tits.
          <year>2022</year>
          .
          <volume>3176914</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. Zhang,</surname>
          </string-name>
          <article-title>LTSR: long-term semantic relocalization based on HD map for autonomous vehicles</article-title>
          ,
          <source>in: 2022 IEEE international conference on robotics and automation (ICRA)</source>
          , IEEE,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .1109/icra46639.
          <year>2022</year>
          .
          <volume>9811855</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , E. Cheng,
          <article-title>Coarse-to-fine semantic localization with HD map for autonomous driving in structural scenes</article-title>
          ,
          <source>in: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS)</source>
          , IEEE,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1109/iros51168.
          <year>2021</year>
          .
          <volume>9635923</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Visual mapping and localization system based on compact instance-level road markings with spatial uncertainty</article-title>
          ,
          <source>IEEE Robot. Autom. Lett. 7</source>
          .
          <issue>4</issue>
          (
          <year>2022</year>
          )
          <volume>10802</volume>
          ̢
          <fpage>10809</fpage>
          . doi:
          <volume>10</volume>
          .1109/lra.
          <year>2022</year>
          .
          <volume>3196470</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12] W. Cheng, S. Yang,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Road mapping and localization using sparse semantic visual features</article-title>
          ,
          <source>IEEE Robot. Autom. Lett. 6</source>
          .
          <issue>4</issue>
          (
          <year>2021</year>
          )
          <volume>8118</volume>
          ̢
          <fpage>8125</fpage>
          . doi:
          <volume>10</volume>
          .1109/lra.
          <year>2021</year>
          .
          <volume>3068948</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wijaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Roadside HD map object reconstruction using monocular camera</article-title>
          ,
          <source>IEEE Robot. Autom. Lett</source>
          . (
          <year>2022</year>
          )
          <article-title>1 ̢ 8</article-title>
          . doi:
          <volume>10</volume>
          .1109/lra.
          <year>2022</year>
          .
          <volume>3185367</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.-C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Papandreou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schroff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <article-title>Encoder-Decoder with atrous separable convolution for semantic image segmentation</article-title>
          ,
          <source>in: Computer vision ̢ ECCV</source>
          <year>2018</year>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , p.
          <volume>833</volume>
          ̢
          <fpage>851</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -01234-2_
          <fpage>49</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>