Multi-agent Approach to Predict the Trajectory of Road Infrastructure Agents Using a Convolutional Neural Network Andrey Azarchenkov 1 and Maksim Lyubimov 11 1 Bryansk State Technical University, 7, 50-let Oktyabrya bul., Bryansk, 241035, Russia Abstract The problem of creating a fully autonomous vehicle is one of the most urgent in the field of artificial intelligence. Many companies claim to sell such cars in certain working conditions. The task of interacting with other road users is to detect them, determine their physical properties, and predict their future states. The result of this prediction is the trajectory of road users’ movement for a given period of time in the near future. Based on such trajectories, the planning system determines the behavior of an autonomous-driving vehicle. This paper demonstrates a multi-agent method for determining the trajectories of road users, by means of a road map of the surrounding area, working with the use of convolutional neural networks. In addition, the input of the neural network gets an agent state vector containing additional information about the object. A number of experiments are conducted for the selected neural architecture in order to attract its modifications to the prediction result. The results are estimated using metrics showing the spatial deviation of the predicted trajectory. The method is trained using the nuscenes test dataset obtained from lgsvl-simulator. Keywords Trajectory prediction, Autonomous-driving car, Convolutional neural network, Nuscenes dataset, lgsvl simulator. 1. Introduction Driving a vehicle is a complex and many-sided task, since it requires real-time analysis of the surrounding and making a decision to perform a particular maneuver. In addition to its complexity, human errors cause serious consequences for both the driver and others. A potential way to minimize driving risks in the future is to autonomous-driving vehicles. The main idea of creating an unmanned vehicle is to develop a software package that can analyze the surrounding of the car and make driving decisions, depending on the result of the analysis. Nowadays the developers of such systems distinguish 6 levels of autonomy of unmanned vehicles [15], where the zero level corresponds to vehicles without automation, and the fifth – to fully autonomous cars. With the increase in the level of autonomy, the question of predicting the position of the surrounding vehicles becomes more acute. The actions of other road users are often variable and difficult to predict. To solve this problem, software systems for driving a car in one form or another have a subsystem for predicting the states of objects [16]. 2. Related works The problem of predicting trajectories can be solved using various types of information. The most popular group of methods is based on working with sequences of positions of objects in space. Most often, such methods are used to predict the trajectories of pedestrians. ETH [1] and UCY [6] data sets can be distinguished. ETH dataset contains 2 scenes taken from a bird's-eye view and contains a total GraphiCon 2021: 31st International Conference on Computer Graphics and Vision, September 27-30, 2021, Nizhny Novgorod, Russia EMAIL: azarchenkovaa@gmail.com (A. Azarchenkov); max32@inbox.ru (M. Lyubimov); ORCID: 0000-0003-4570-4442 (A. Azarchenkov); 0000-0003-0702-3662 (M. Lyubimov); ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) of 750 different motion paths. In this case, the distance between the positions in space is determined with a step of 0.4 seconds. UCY dataset contains 3 scenes, also shot from a bird's eye view. In total, it contains 900 trajectories with a position change step of 0.4 seconds. Usually, these two data sets are used together, and often one of the five scenes is selected for testing, and methods using recurrent layers are usually applied [2, 5]. Prediction methods differ from each other in the way they provide information about the future positions of objects. Most of methods, including the work discussed above, generate one or more object trajectories. However, other methods of prediction are also possible. For example, in [18] the result of the neural model is the area where the object can be, and the color indicates the time after which it will be there (Figure 1). To obtain such a grid map, the surrounding space is divided into cells of a fixed size and the probability of finding an object in each of these cells is indicated. Joint prediction of objects trajectories is also possible [19]. In this case, in addition this approach takes into account not only the positions of other objects, but also their intentions. Figure 1: An example of grid maps generated This paper describes a method for solving the problem in the field of autonomous vehicles. The peculiarity of the task is that the prediction must be made for road users (agents) of different types. At the same time, it is important that the positions of all surrounding objects are taken into account when making the prediction. This approach is called multi-agent. Another feature of this field is that today the common practice for unmanned cars is the use of pre-prepared maps, which are referred to hd maps [14]. Most often, they are used for the movement of the unmanned car itself, but since these maps repeat public roads, their use allows to predict the behavior of agents more accurately. In [13], the prediction of cars behavior was based on the use of the map of the surrounding area. For the prediction system, such maps are very often presented in a bitmap format. If it is necessary to make prediction, an image is created for the agent, which indicates its position on the map, as well as the position of other agents. In this case, objects of different types are highlighted in different colors. In addition, for each object its previous states are plotted at a fixed point in time. Thus, using the previous states, you can evaluate the speed of the object. Such an image is used, for example, in CoverNet neural network [4] and is called input representation. The advantage of this approach is the ability to use convolutional layers, which perfectly solve the problem of highlighting key features in the image. However, there is also an approach in which the input representation is given as a vector [3]. The authors used graph neural network, which input contained map elements and the trajectories of agents as vectors. The vectors formed polylines, each group was initially processed by the neural network separately, and then a fully connected graph was used, based on the fact that the features are connected to each other. 3. Method The main idea of the prediction method is to generate an image containing the object under study, other objects, as well as a map of the area. Such an image is referred to as an input representation. Due to the fact that all agents are added to the input representation, the approach can be considered multi- agent. Figure 2 shows an example of such an image. This input representation is fed to the input of the convolutional neural network. Figure 2: An example of input representation RepVGG architecture is used as the main convolutional network (backbone) [7]. A distinctive feature of this architecture is its simplicity and speed. This convolutional network has different states at the time of training and at the time of using the model (Figure 3). At the time of training, the network has a residual structure, which is used primarily in ResNet models [8]. The idea of such a structure is to increase the accuracy of deep neural networks, since without a residual mechanism, the error of the network increases with increasing its depth. The network contains convolutional blocks, relu activation function, batch normalization, and layer concatenation. After training, the neural model can be converted into a simple network that does not have branching. At the same time, the network variant for applying the model contains only 3x3 convolutional blocks and ReLU activation function. These layers are well optimized in Nvidia Cudnn [9] and Intel MLL [10] libraries, which allows to get the maximum performance of the model. This aspect is relevant, since the task of predicting movement trajectories is often performed for a large number of objects at the same time. In addition, after converting the model, the video memory consumed by it is reduced, which is also important for embedded systems. After the main convolutional network, there is a block for processing the received features. Two methods were tested. In the first case, the feature matrix is reduced to a one-dimensional type, the result of the operation is fed to the input with linear layers, to extract the trajectory and confidence for each of them. The second method is to perform a weighted subsampling. Initially, the features are multiplied by the matrix, which values are selected as a result of training, then subsampling operation is performed. The input of the neural model, in addition to the input representation, is also fed with a vector containing the speed, acceleration, and class of the object for which the prediction is made. Using such a vector, the neural network explicitly obtains the physical characteristics of the object. This vector is combined with the image features obtained after convolution (Figure 4). 4. Data preparation Nuscenes data set was selected for training the neural model [11]. It consists of 1000 scenes, each lasting 20 seconds. Each scene is a set of data flows from sensors installed on the car, as well as an annotation for them. Cameras, lidars, radars, GPS and IMU sensors are used as these sensors. The entire dataset contains 1,400,000 images and 390,000 lidar images. The data set is characterized by a high- quality annotation, which is very important for predicting trajectories. The annotation contains a description of all the objects, and also maps of the surrounding area. To work with the data set, the developers provided a library that allows to access objects and recorded data. In addition to this data set, our own data were collected on the basis of the open software package for driving a car – Apollo [17], as well as LGSVL simulator [12]. The main idea of creating our own data set is to use the results of the surrounding object detection system to generate an annotation to input representations. LGSVL simulator allows to receive messages in a format similar to Apollo scene system. Apollo software package is based on the cyber framework, which gives the opportunity to form and transmit messages between separate independent modules. To get our own data set, a program was made that reads such messages and creates input representations based on them. In this case, the future states of the detected object were used as an annotation. Since the correct determination of the object's position was important for this task, only those objects detected by means of lidar were used. As a result of comparing the data in nuscenes training sample and in our own sample, it turned out that the average speed of all moving cars in our own sample is 6 m/s, while in nuscenes data set it is 2 m/s. This comparison indicates the need to collect own training sample when using the neural model in practice. For further training, it was decided to combine the collected samples, since the developed data set contains new patterns of vehicle behavior. As part of the work, we used the vertical reflection of the input representation. In addition to reflecting the input representation itself, the true trajectory of the object also changes. The tests performed (Table 3) showed that these augments have a positive effect on the result. Figure 3: Structure of a neural model while training (left) and inference (right) Figure 4: Configuration of input layers of the neural model 5. Experiments To solve the prediction problem, a number of experiments were conducted with different configurations of the neural model and training parameters. To verify the results, the neural model was validated on a pre-prepared training sample, to which all moving cars, pedestrians, and cyclists from Nuscenes test sample were added, resulting in 53,154 input representations. The following metrics were used to evaluate the quality of the neural model:  average displacement error (ADE) is the average distance between two separate points of the trajectory expressed in meters (1); 12 𝐴𝐷𝐸 = ∑ √(𝑋𝑖 − 𝑥𝑖 )2 + (𝑌𝑖 − 𝑦𝑖 )2 (1) 𝑖=1 where 𝑋, 𝑌 – ground truth coordinates, 𝑥, 𝑦 – predicted coordinates.  final displacement error (FDE) is the distance between two final points expressed in meters (2); 𝐹𝐷𝐸 = √(𝑋12 − 𝑥12 )2 + (𝑌12 − 𝑦12 )2 (2) where 𝑋, 𝑌 – ground truth coordinates, 𝑥, 𝑦 – predicted coordinates. All experiments were conducted sequentially and the best candidate was selected at each stage. The network was trained to predict possible trajectories, and the metrics were calculated using all predictions simultaneously. Figure 5 shows an example of such a prediction. According to the results of testing the available architectures (Table 1), A2 model was chosen for further consideration, since according to the results obtained, the operating time of the models increases from A0 to B2, respectively. Since time is an important factor for this task, we gave preference to faster architecture instead of accuracy. Table 2 shows a comparison of metrics for a trained model with and without augmentation. Since vertical augmentation had a positive effect on the result, we decided to conduct further experiments with it. The additional information vector (Table 3) also had a positive effect on the result. The final test was conducted to identify the optimal number of trajectories (Table 4). According to the results, an increase in the number of trajectories has a positive effect on the result. However, when the algorithm is actually used in a prediction system, excessive trajectories can cause a lot of uncertainty for the planning system. Thus, a decision was made to stop at 9 trajectories, since the growth of metrics does not compensate for the increasing uncertainty in practice. Figure 5 shows the result of the network operation. To make it clear, the picture contains the trajectory that has the highest probability. Table 1 Comparison of test set metrics for different neural network architectures Architecture ADE FDE Inference time(ms) RepVgg-A0 1.153 2.537 1.0021 RepVgg-A1 1.131 2.485 1.2499 RepVgg-A2 1.112 2.429 2.2184 RepVggB0 1.102 2.392 2.9513 RepVggB1 1.097 2.365 3.9749 RepVggB2 1.057 2.299 5.7855 Table 2 Comparison of test set metrics for different augmentation ways Augmentation ADE FDE RepVgg-A2 without augmentation 1.112 2.429 RepVgg-A2 with augmentation 1.088 2.365 Table 3 Comparison of test set metrics for approaches with using additional data and without them Using additional data ADE FDE RepVgg-A2 without additional data 1.088 2.365 RepVgg-A2 with additional data 1.07 2.324 Table 4 Comparison of test set metrics for approaches with different number of predictions A number of predicted trajectories ADE FDE RepVgg-A2 with 1 predicted trajectory 1.07 2.324 RepVgg-A2 with 3 predicted trajectories 0.664 1.355 RepVgg-A2 with 5 predicted trajectories 0.53 1.013 RepVgg-A2 with 7 predicted trajectories 0.5 0.926 RepVgg-A2 with 9 predicted trajectories 0.467 0.829 RepVgg-A2 with 11 predicted trajectories 0.438 0.744 RepVgg-A2 with 13 predicted trajectories 0.417 0.688 Figure 5: An example of movement trajectory prediction 6. Conclusion The paper considers the problem of predicting the trajectories of road network agents. To solve this problem, we used a multi-agent approach with an input representation to make a prediction. A convolutional neural network was designed that is suitable for this task in terms of speed and accuracy, nuscenes training sample was analyzed, and our own data were collected to expand it. The resulting solution was tested taking into account various possible modifications and network configurations. On the basis of the tests it was found that the applied approach to predicting trajectories does not have such a significant dependence on the depth of the neural model, compared to classical problems that use convolutional neural networks, for example, in the task of detecting objects. In addition, an important observation is that the resulting neural network is able to determine the type of object implicitly and make a prediction in accordance with the behavior patterns characteristic of this type. Also, the neural model successfully takes into account other agents when making a prediction. The resulting metrics allow us to judge that the described approach can be successfully applied in real systems for driving an unmanned car. 7. References [1] S. Pellegrini, A. Ess, K. Schindler, L. van Gool, You’ll Never Walk Alone: Modeling Social Behavior for Multi-Target Tracking. In: 2009 IEEE 12th International Conference on Computer Vision. Sept. 2009, pp. 261–268. doi:10.1109/ICCV.2009.5459260. [2] N. Ma, Yuexin TrafficPredict: Trajectory Prediction for Heterogeneous Traffic. arXiv:1811.02146v5, 2019. – 8 p. doi:10.1609/aaai.v33i01.33016120 [3] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, C. Schmid, VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11525-11533. doi: 10.1109/CVPR42600.2020.01154 [4] E. Grigore, F. Boulton, O. Beijbom, E. Wolff, CoverNet: Multimodal Behavior Prediction using Trajectory Sets 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 14062-14071. doi: 10.1109/CVPR42600.2020.01408 [5] Z. Wenjing, Trajectory Prediction with Recurrent Neural Networks for Predictive Resource Allocation. 14th IEEE International Conference on Signal Processing (ICSP), 2018 – 1-8 p. [6] A. Lerner, Y. Chrysanthou, D. Lischinski, Crowds by Example. Computer Graphics Forum (2007), pp. 655–664. [7] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, J. Sun, RepVGG: Making VGG-style ConvNets Great Again arXiv preprint arXiv:2101.03697 [8] K. He, X. Zhang, S. Ren, J. Sun, Residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. doi: 10.1109/CVPR.2016.90 [9] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, B. Shelhamer, cudnn: Efficient primitives for deep learning, arXiv preprint arXiv:1410.0759, 2014. [10] Intel oneAPI Math Kernel Library, 2020. URL: https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html. [11] Nuscenes dataset, 2020. URL: https://www.nuscenes.org. [12] SVL Simulator, 2021. URL: https://www.svlsimulator.com/. [13] A. Azarchenkov, M. Lyubimov, Algorithm for predicting the trajectory of road users to automate control of an autonomous vehicle CEUR Workshop Proceedings, 2020, 2744 [14] S. Casas, A. Sadat, R. Urtasun, MP3: A Unified Model to Map, Perceive, Predict and Plan. arXiv:2101.06806 2021 [15] Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles,” SAE J3016, 2016. URL: https://www.sae.org/standards/content/j3016_201806/. [16] The Autoware Foundation, 2020. URL: https://www.autoware.org/. [17] Apollo autonomous driving, 2020. URL: https://apollo.auto/. [18] A. Jain, S. Casas, R. Liao, Y. Xiong, S. Feng, S. Segal, R. Urtasun Discrete Residual Flow for Probabilistic Pedestrian Behavior Prediction, arXiv:1910.08041v1. [19] A. Cui, A. Sadat, S. Casas, R. Liao, R. Urtasun, LookOut: Diverse Multi-Future Prediction and Planning for Self-Driving, arXiv:2101.06547v1, 2020