<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Object Detection in Roadside Settings: Extending Data and Enhancing Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sondos Mohamed</string-name>
          <email>sondoswa.mohamed@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Object Detection, Monocular Models, Roadside Cameras.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, University of Cagliari</institution>
          ,
          <addr-line>Via Ospedale, 72, 09124 ,Cagliari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Understanding three-dimensional objects is crucial in domains like urban autonomous driving, roadside monitoring, augmented and virtual reality. Traditionally, this task required expensive LiDAR sensors and stereo RGB imaging due to the limitations of monocular image-only methods, which could not count on depth information. Recent advances in monocular models based on deep learning have improved this situation, yet real-world challenges persist. For instance, variations in camera properties and object complexity constrain existing monocular 3D object detection. In my PhD research, I focus on monocular 3D object detection from images collected by roadside cameras. Firstly, my objective is to curate diverse datasets that encompass a wide array of scenarios and camera configurations. Secondly, I strive to train and assess detection models, surmounting existing limitations. Thirdly, my goal is to refine these models, fostering adaptability and robustness, thereby empowering them to generalize across diverse scenes and scenarios. This work advances monocular 3D object detection in domains like roadside monitoring.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Object detection methods in single images can take two forms: 2D methods [
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12</xref>
        ] or 3D methods [
        <xref ref-type="bibr" rid="ref13 ref14 ref15 ref16 ref17 ref18 ref19">13, 14, 15, 16, 17, 18, 19</xref>
        ]. The application of 3D object detection
approaches ofers advantages. By providing a better understanding of the scene and enabling the
detection of occluded objects, they can enhance the accuracy and reliability of object detection
in complex environments. Moreover, they are better suited to describe object pose and shape.
      </p>
      <p>
        However, the lack of depth information in 2D images makes it challenging to precisely
estimate the size and location of objects. 3D object detection has applications in both indoor
and outdoor contexts. In outdoor scenarios, recent advancements in autonomous driving have
shown promising results [
        <xref ref-type="bibr" rid="ref14 ref15 ref16 ref17">14, 15, 16, 17</xref>
        ]. Furthermore, the adoption of an increasing number of
datasets [
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27">20, 21, 22, 23, 24, 25, 26, 27</xref>
        ] has further improved the efectiveness of this technology.
      </p>
      <p>
        Nevertheless, upon closer examination of outdoor datasets, it becomes evident that most of
them are tailored for autonomous driving, with only a limited number focusing on roadside
scenarios [
        <xref ref-type="bibr" rid="ref28 ref29 ref30 ref31 ref32">28, 29, 30, 31, 32</xref>
        ]. A more in-depth analysis of existing monocular models reveals
that the majority are trained and tested using a single dataset. Recent eforts have attempted
to address these limitations. For instance, a recent work [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] has made significant progress by
integrating indoor and outdoor datasets into a large, standardized dataset and training models to
cover this diversity. However, it is important to note that this work does not encompass roadside
datasets, and their pretrained models face challenges when tested on roadside images. Most of
the roadside cameras are originally set up as CCTV cameras, and many have been in place on
the road for decades. However, some of them lack crucial information. such as focal length,
camera coordinate systems, and other configuration details. With this in mind, this PhD thesis
aims to generate a diverse dataset that includes diferent scenes, focal lengths, and resolutions.
Subsequently, we aim to develop models that can accommodate these diverse requirements. Our
ultimate goal is to determine whether our model can efectively demonstrate its generalization
during the testing phase by achieving satisfactory performance on previously unseen scenes and
datasets without retraining. This endeavor underscores the need for comprehensive, versatile
datasets and models in the field of 3D object detection, especially for real-world scenarios.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Research Plan</title>
      <p>This research plan outlines my three-year PhD trajectory aimed at advancing the field of
monocular 3D object detection for roadside monitoring. It encompasses a systematic journey
from building a foundational understanding of the subject to practical application and integration
into real-world scenarios. The plan is organized along three distinct years, delineated as follows.
Year 1: Literature Review and Foundation. During the first year of my doctoral program, I
established a solid foundation for my research by conducting an extensive literature review.
This review encompassed the study of 2D and 3D object detection methodologies, as well as
object tracking and person re-identification techniques, with a specific focus on their application
in roadside monitoring using CCTV cameras. My primary objectives during this foundational
year included gaining a deep understanding of the existing research landscape and familiarizing
myself with fundamental concepts and advanced techniques in computer vision for CCTV
camera-based monitoring. These eforts provided the groundwork for the subsequent steps.
Year 2: Data Generation and Model Reproduction. In the second year of my doctoral
journey, I transitioned from theory to practical implementation. This phase involved two main
objectives: data generation and model adaptation. I used simulators to create synthetic datasets,
customized for roadside monitoring, allowing for initial model testing. Simultaneously, I worked
on adapting and fine-tuning existing 3D object detection models, originally developed for
autonomous driving, to suit the specific needs of roadside monitoring. Key achievements included
the creation of synthetic datasets with controlled variations in parameters like resolution and
focal lengths, as well as the adaptation of state-of-the-art monocular 3D object detection models
to the generated dataset, aligning them with requirements of roadside monitoring.
Year 3: Real-World Data Creation, Model Improvement, and Integration. In the third
and pivotal year of my research, the focus shifts to working with authentic real-world data. This
phase involves several key activities, including the curation of datasets derived from roadside
cameras, rigorous evaluations of previously developed models to assess their generalizability,
and enhancements to improve model eficiency, accuracy, and robustness. The culmination of
this year’s eforts will be the seamless integration and thorough evaluation of these advanced
models into an existing interface customized for use by local municipalities in Sardinia. The
significant achievements during this transformative year include the curation of authentic datasets
representing real-world scenarios, extensive experimentation to test the models’ adaptability,
eficiency improvements based on insights from real-world data, and the successful integration
and evaluation of the refined models in a practical operational setting. This structured and
progressive research plan seeks to bridge the gap between theoretical knowledge and practical
application in the domain of 3D object detection for roadside monitoring. It strives to make a
comprehensive and impactful contribution to the field of computer vision, particularly within
the context of enhancing roadside safety and security.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Current Results</title>
      <p>
        The current result is the first version of ”MonoRoadCam” [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], a synthetic dataset serving two
purposes: facilitating the adaptation of 3D object detection methods for roadside cameras and
evaluating existing methods from the autonomous driving domain. MonoRoadCam was created
using the CARLA simulation environment [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], ensuring data closely resembling real-world
scenarios and adhering to the KITTI format [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Our contributions is threefold:
• Synthetic Dataset Generation: MonoRoadCam is designed for monocular 3D object
detection, featuring 7,481 development images and 7,518 test images, all annotated with
object type, size, location, and orientation, with simulations of three weather conditions.
• Model Reproduction and Evaluation: Our research verifies the reproducibility of
state-ofthe-art monocular 3D object detection methods (M3DRPN [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Kinematic [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], SMOKE
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Monodle [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) originally designed for autonomous driving in the roadside context.
• Comparative Study: We conducted an extensive comparative study between 3D object
detection datasets captured by roadside and frontal cameras, revealing the potential
and limitations of applying autonomous driving solutions directly without training to
monocular roadside camera images.
      </p>
      <p>
        Quantitatively, when tested on the original KITTI dataset [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], the reproduced models show
a slight decrease in performance compared to the evaluation metric scores reported in their
respective papers, with SMOKE exhibiting a notable drop. Comparing results across datasets,
the performance measured on our synthetic dataset was higher than the performance measured
on the original KITTI dataset. Finally, qualitatively, our evaluation shows the potential of
MonoRoadCam as a valuable resource for advancing monocular 3D object detection.
      </p>
    </sec>
    <sec id="sec-5">
      <title>4. Open Challenges and Expected Benefits</title>
      <p>Based on the open challenge, I aim to seek answers to the following three key questions:
• How can I efectively bridge the gap between existing 3D object detection
methods developed for autonomous driving and their applicability to the unique
challenges of roadside monitoring, especially when working with monocular
images from diverse cameras? This question lies at the heart of my research. The
transition from well-established autonomous driving scenarios to roadside monitoring
introduces a set of distinctive challenges. By addressing this question, I aim to develop
solutions that not only function efectively but also integrate into real-world indoor and
outdoor environments that require object detection from monocular cameras.
• Which strategies can be employed to systematically diversify roadside datasets,
by incorporating variations in parameters such as resolution, focal lengths, and
full orientation for the object in outdoor scenarios (instead of yaw only)? How
will these controlled variations impact the adaptability and performance of the
current 3D object detection models? The quality of data is paramount in training robust
computer vision models. However, the dynamic nature of roadside scenarios demands
datasets that are both diverse and reflective of real-world conditions. By exploring
controlled variations in critical parameters such as resolution and focal lengths, I aim
to not only enrich my data but also understand how these variations influence the
adaptability and overall performance of my 3D object detection models.
• In the context of my research, how can I assess the generalizability of
models efectively, ensuring that they perform satisfactorily on previously unseen
scenes and datasets without requiring extensive retraining, and what are the
best practices for achieving this? The ability of computer vision models to generalize
across diferent scenarios is pivotal for their practical utility. Assessing and ensuring
the generalizability of my models is a central aspect of my research objectives. I aim
to develop models that not only excel in controlled environments but also demonstrate
robustness and reliability when applied to unpredictable roadside situations.</p>
      <p>Addressing these questions in my research aims to provide a foundational framework for
other researchers striving to develop a zero-shot model that overcomes current challenges and
limitations in 3D object detection for roadside monitoring in the next-generation smart cities.
Acknowledgements. The author extends gratitude to Prof. Salvatore Carta and Dr. Mirko
Marras for their supervision. Special thanks to Marco Sau for his support in part of this project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster R-CNN</surname>
          </string-name>
          <article-title>: towards real-time object detection with region proposal networks</article-title>
          ,
          <source>in: NIPS</source>
          <year>2015</year>
          ,
          <year>2015</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cholakkal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Anwer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pang</surname>
          </string-name>
          , L. Shao,
          <article-title>D2det: Towards high quality object detection and instance segmentation</article-title>
          ,
          <source>in: CVPR</source>
          <year>2020</year>
          , Computer Vision Foundation / IEEE,
          <year>2020</year>
          , pp.
          <fpage>11482</fpage>
          -
          <lpage>11491</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          , G. Gkioxari,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mask</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          ,
          <source>CoRR abs/1703</source>
          .06870 (
          <year>2017</year>
          ). arXiv:
          <volume>1703</volume>
          .
          <fpage>06870</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Zhang,</surname>
          </string-name>
          <article-title>Scale-aware trident networks for object detection</article-title>
          ,
          <source>in: ICCV</source>
          <year>2019</year>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>6053</fpage>
          -
          <lpage>6062</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>You only look once: Unified, real-time object detection</article-title>
          ,
          <source>in: CVPR</source>
          <year>2016</year>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          ,
          <article-title>SSD: single shot multibox detector</article-title>
          ,
          <source>in: ECCV</source>
          <year>2016</year>
          , volume
          <volume>9905</volume>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Law</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          , Cornernet:
          <article-title>Detecting objects as paired keypoints</article-title>
          ,
          <source>in: ECCV</source>
          <year>2018</year>
          , volume
          <volume>11218</volume>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>765</fpage>
          -
          <lpage>781</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Krähenbühl</surname>
          </string-name>
          , Objects as points, CoRR abs/
          <year>1904</year>
          .07850 (
          <year>2019</year>
          ). arXiv:
          <year>1904</year>
          .07850.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Focal loss for dense object detection</article-title>
          ,
          <source>in: ICCV</source>
          <year>2017</year>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>2999</fpage>
          -
          <lpage>3007</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Atzori</surname>
          </string-name>
          , G. Fenu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marras</surname>
          </string-name>
          ,
          <article-title>Explaining bias in deep face recognition via image characteristics</article-title>
          ,
          <source>in: IJCB</source>
          <year>2022</year>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Atzori</surname>
          </string-name>
          , G. Fenu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marras</surname>
          </string-name>
          ,
          <article-title>Demographic bias in low-resolution deep face recognition in the wild</article-title>
          ,
          <source>IEEE J. Sel. Top. Signal Process</source>
          .
          <volume>17</volume>
          (
          <year>2023</year>
          )
          <fpage>599</fpage>
          -
          <lpage>611</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Fenu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lafhouli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marras</surname>
          </string-name>
          ,
          <article-title>Exploring algorithmic fairness in deep speaker verification</article-title>
          ,
          <source>in: ICCSA</source>
          <year>2020</year>
          , volume
          <volume>12252</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>93</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brazil</surname>
          </string-name>
          , X. Liu,
          <article-title>M3D-RPN: monocular 3d region proposal network for object detection</article-title>
          ,
          <source>in: ICCV</source>
          <year>2019</year>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>9286</fpage>
          -
          <lpage>9295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brazil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pons-Moll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          ,
          <article-title>Kinematic 3d object detection in monocular video</article-title>
          ,
          <source>in: ECCV</source>
          <year>2020</year>
          , volume
          <volume>12368</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2020</year>
          , pp.
          <fpage>135</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tóth</surname>
          </string-name>
          ,
          <article-title>SMOKE: single-stage monocular 3d object detection via keypoint estimation</article-title>
          ,
          <source>in: CVPR</source>
          <year>2020</year>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>4289</fpage>
          -
          <lpage>4298</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <article-title>Delving into localization errors for monocular 3d object detection</article-title>
          ,
          <source>in: CVPR</source>
          <year>2021</year>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>4721</fpage>
          -
          <lpage>4730</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>FCOS3D: fully convolutional one-stage monocular 3d object detection</article-title>
          ,
          <source>in: ICCV</source>
          <year>2021</year>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>913</fpage>
          -
          <lpage>922</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Probabilistic and geometric depth: Detecting objects in perspective</article-title>
          ,
          <source>in: Conference on Robot Learning</source>
          , volume
          <volume>164</volume>
          ,
          <string-name>
            <surname>PMLR</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>1475</fpage>
          -
          <lpage>1485</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brazil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Straub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ravi</surname>
          </string-name>
          , J. Johnson, G. Gkioxari,
          <article-title>Omni3d: A large benchmark and model for 3d object detection in the wild</article-title>
          ,
          <source>CoRR abs/2207</source>
          .10660 (
          <year>2022</year>
          ). arXiv:
          <volume>2207</volume>
          .
          <fpage>10660</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Geiger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Urtasun</surname>
          </string-name>
          ,
          <article-title>Are we ready for autonomous driving? the KITTI vision benchmark suite</article-title>
          ,
          <source>in: ICPR</source>
          <year>2012</year>
          , IEEE,
          <year>2012</year>
          , pp.
          <fpage>3354</fpage>
          -
          <lpage>3361</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Patil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>The H3D dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes</article-title>
          ,
          <source>in: ICRA</source>
          <year>2019</year>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>9552</fpage>
          -
          <lpage>9557</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>R. S. P. H. Z. C. H. P. Y. C. A. M. V. C. J. L. Quang-Hieu</surname>
            <given-names>Pham</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Sevestre</surname>
          </string-name>
          ,
          <string-name>
            <surname>A*</surname>
          </string-name>
          <article-title>3d dataset: Towards autonomous driving in challenging environments</article-title>
          ,
          <source>in: ICRA</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>H.</given-names>
            <surname>Caesar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bankiti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Lang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. E.</given-names>
            <surname>Liong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Baldan</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.</surname>
          </string-name>
          <article-title>Beijbom, nuscenes: A multimodal dataset for autonomous driving</article-title>
          , CoRR abs/
          <year>1903</year>
          .11027 (
          <year>2019</year>
          ). arXiv:
          <year>1903</year>
          .11027.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lambert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sangkloy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartnett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Carr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lucey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <article-title>Argoverse: 3d tracking and forecasting with rich maps</article-title>
          ,
          <source>in: CVPR</source>
          <year>2019</year>
          , Computer Vision Foundation / IEEE,
          <year>2019</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8757</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>One million scenes for autonomous driving: ONCE dataset</article-title>
          ,
          <source>in: NIPS</source>
          <year>2021</year>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          , X. Cheng,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>The apolloscape dataset for autonomous driving</article-title>
          ,
          <source>in: CVPR</source>
          <year>2018</year>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>954</fpage>
          -
          <lpage>960</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kretzschmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dotiwalla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chouard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patnaik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tsui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Caine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          , W. Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Ngiam,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Timofeev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ettinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krivokon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <article-title>Scalability in perception for autonomous driving: Waymo open dataset</article-title>
          , CoRR abs/
          <year>1912</year>
          .04838 (
          <year>2019</year>
          ). arXiv:
          <year>1912</year>
          .04838.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>E.</given-names>
            <surname>Strigel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Meissner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Seeliger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wilking</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dietmayer</surname>
          </string-name>
          ,
          <article-title>The ko-per intersection laserscanner and video dataset</article-title>
          ,
          <source>in: ITSC</source>
          <year>2014</year>
          ,
          <year>2014</year>
          , pp.
          <fpage>1900</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Ding,</surname>
          </string-name>
          <article-title>Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task</article-title>
          ,
          <source>in: CVPR</source>
          <year>2022</year>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>21309</fpage>
          -
          <lpage>21318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>BAAI-VANJEE roadside dataset: Towards the connected automated vehicle highway technologies in challenging environments of china</article-title>
          ,
          <source>CoRR abs/2105</source>
          .14370 (
          <year>2021</year>
          ). arXiv:
          <volume>2105</volume>
          .
          <fpage>14370</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sochor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Špaňhel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herout</surname>
          </string-name>
          , Boxcars:
          <article-title>Improving fine-grained recognition of vehicles using 3-d bounding boxes in trafic surveillance, IEEE Transactions on Intelligent Transportation Systems PP (</article-title>
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Nie,</surname>
          </string-name>
          <article-title>DAIRV2X: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection</article-title>
          ,
          <source>CoRR abs/2204</source>
          .05575 (
          <year>2022</year>
          ). arXiv:
          <volume>2204</volume>
          .
          <fpage>05575</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Barra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Podda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Saia</surname>
          </string-name>
          ,
          <article-title>Can existing 3d monocular object detection methods work in roadside contexts? A reproducibility study</article-title>
          ,
          <source>in: AIxIA</source>
          <year>2023</year>
          , volume
          <volume>14318</volume>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>321</fpage>
          -
          <lpage>335</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Codevilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Koltun</surname>
          </string-name>
          ,
          <string-name>
            <surname>CARLA:</surname>
          </string-name>
          <article-title>an open urban driving simulator</article-title>
          ,
          <source>in: CoRL</source>
          <year>2017</year>
          , volume
          <volume>78</volume>
          ,
          <string-name>
            <surname>PMLR</surname>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          . URL: http://proceedings.mlr. press/v78/dosovitskiy17a.html.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>