=Paper=
{{Paper
|id=Vol-2744/paper46
|storemode=property
|title=Big LIDAR Data in Digital Earth: Ways Out of Dead End
|pdfUrl=https://ceur-ws.org/Vol-2744/paper46.pdf
|volume=Vol-2744
|authors=Ilya Rylskiy
}}
==Big LIDAR Data in Digital Earth: Ways Out of Dead End==
Big LIDAR Data in Digital Earth: Ways Out of Dead End* Ilya Rylskiy [0000-0002-7675-4621] Lomonosov Moscow State University, 119991, Leninskie Gory, b. 1, Moscow, Russian Federation rilskiy@mail.ru Abstract. During past 25 years, laser scanning has evolved from an exper- imental method into a fully autonomous family of Earth remote sensing methods. Now this group of methods provides the most accurate and de- tailed spatial data sets, while the cost of data is constantly falling, the num- ber of measuring instruments (laser scanners) is constantly growing. The volumes of data that will be obtained during the surveys in the coming dec- ades will allow the creation of the first sub-global coverage of the planet. However, the flip side of high accuracy and detail is the need to store fan- tastically large volumes of three-dimensional data without loss of accuracy. At the same time, the ability to work with the specified data in both 2D and 3D mode should be improved. Standard storage methods (file method, geo- databases, archiving, etc) solve the problem only partially. At the same time, there are some other alternative methods that can remove current re- strictions and lead to the emergence of more flexible and functional spatial data infrastructures. One of the most flexible and promising ways of laser data storage and processing are quadtree and octree-based approaches. Of course, these approaches are more complicated than typical file data struc- tures, that are commonly used for LIDAR data storage, but they allow users to solve some typical negative features of point datasets (processing speed, non-topological spatial structure, limited precision, etc.). Keywords: LIDAR, Big Data, Digital Earth, Octree. 1 Introduction Laser scanning, or LIDAR, is one of the youngest types of surveying. A laser shooting device (laser scanner) is a laser range finder that performs a single or group of simultaneous range measurement with simultaneous measurement of beam deflec- tion angles (vertical and horizontal plane). If each of the measured reflections has Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). * Publication is supported by RFBR grant № 19-07-00844. 2 I. Rylskiy accurate time stamps, the data can be synchronized with high-precision inertial navi- gation systems (INS) and GNSS tools for positioning in global coordinate systems. In addition to coordinate information, a laser scanner can register the amplitude of the reflected signal, the albedo of the surface reflecting the signal, as well as register more than one reflected signal, classify the signals according to the shape of the re- flected pulse, and much more. The method is rapidly developing since the late 1990s. The number of laser scan- ning systems produced during this time around the world is (according to various estimates) several thousand units, with the vast majority of these systems released in recent years. Each of these systems is characterized by the highest data performance. However, most of these systems (due to the extremely high cost) are constantly in operation. All of these LIDARs continuously produce a huge amount of data. However, the task of combining these enormous datasets into services like Google Earth has not yet been resolved even at a conceptual level. Storing petabyte-sized data arrays while preserv- ing the possibility of quick access requires different approaches compared to those accepted now. 2 Defining problems 2.1 Modern types of LIDAR systems By mid-2020, several main types of laser scanning systems can be distinguished. Roughly all systems can be divided into two large classes, operating on the principles of measuring the time of flight of the signal (TOF-LIDAR), and measuring the phase of the incoming radiation wave generated earlier by the device itself at frequencies substantially lower than the radiation frequency of the scanner itself (phase LIDAR). Phase-range lasers generate a modulated light signal. When measuring distance, the phase shift of the emitted and received signals is measured. With significant range, the scanning frequency and accuracy are reduced. High performance is achieved by reducing the range to 120, 50 and even 15 meters - while measuring ac- curacy increases. A number of algorithms exist and have been tested for operation in conditions of ranges exceeding limits mentioned above. At the same time, the accura- cy suffers: if at short distances it can be sub-millimeter, then at a distance of 150 m achieving accuracy of 20-30 mm is very problematic. Lasers that measure the time of flight of a signal theoretically have no reasonable range limitations at all - it is determined only by the power of the laser pulse and the accuracy of beam focusing. However, they have limitations on accuracy. In 2020, the most advanced instrument samples have a range determination accuracy of about 3 mm (over a range of ranges from 5 to 500 m) and up to 25 mm at ranges of up to 6,000 m and more [1]. Each of these LIDAR types also can be divided into subclasses - scanners for stat- ic and for kinematic measurements. Systems for static work are usually called terres- trial laser scanners, systems for kinematic measurements (in motion) are usually also divided into sub-classes: airborne laser scanners (for UAVs or manned carriers) and Big LIDAR Data in Digital Earth: Ways Out of Dead End 3 mobile laser scanners (for working with cars, trains, special carriers or being trans- ported by a person). Terrestrial systems may not have accurate time stamps and may not be able to be automatically accurately positioned on the surface of the Earth. Kinematic systems, in principle, cannot exist without this feature, and are always positioned first in global coordinate systems (for example, WGS84), and only then are they subject to adjust- ment procedures. 2.2 Performance of modern LIDAR systems The performance of all modern kinematic scanning systems is extremely high. The most advanced airborne scanning systems - for example, the Riegl Q1560ii - have a scanning speed of 4,000,000 points per second [2]. The most advanced mobile sys- tems - for example, Riegl VMX RAIL - up to 3,000,000 points per second [3]. Sys- tems of other companies (Leica, Optech, a number of others) have worse characteris- tics, but their flagship systems also reach 1,000,000 points per second. Static scanning systems also allow surveying at speeds of over 1,000,000 meas- urements per second (Riegl VZ2000i). At the same time, as a rule, all systems are also equipped in parallel with cameras synchronized with laser scanners. This makes it possible (with a high-quality calibration of all devices) to assign to each scan point also the RGB brightness attributes of a given surface area. It allows to make laser point cloud colorized. Thus, all the described systems are capable of creating spatial data of uncompro- mising accuracy with a speed not previously encountered. It is easy to calculate that in 5 hours of full-time continuous operation, a system with a performance of 2 million points will produce about 36 billion points. In fact, it is about 50 billion, since one impulse usually gives more than one reflection in the presence of vegetation at the territory. If we briefly estimate the data storage capacity of 60 bytes per point, this gives about 3 terabytes of laser scanning data only. And here we did not take into account that aerial images are also usually being associated with laser scanning. In terms of 1 km2, the amount of data also looks quite impressive. Now it is not uncommon to ensure accuracy of 10 points per 1 m2 and detail of images at 5 cm / pixel. This requires 600 bytes for laser scanning data and approximately 200 bytes for storing the finished orthomosaic in JPG format, for a total of 800 bytes per 1 m2. Or 0.8 gigabytes to store data per 1 km2. To store data on an area of a medium-sized region of the Russian Federation with an area of 100,000 km2, 80 terabytes will al- ready be required. Up to 14 petabytes of space will be required to store data of the above detail on the territory of the Russian Federation (Fig.1). Taking into account the above factors, it can be noted that the task of storing and processing laser scanning data in large volumes in the near future will become more and more difficult. Solving this problem will require more and more specific ap- proaches, more typical for working with other types of Big Data [4]. 4 I. Rylskiy Fig. 1. 3D chart, illustrating the relative volume of spatial data per 1 km2, produced by satellite sensors, airborne photography and LIDAR. 3 Current and perspective methods of LIDAR data storage and processing 3.1 File data storage - inconvenient reality Despite the growing difficulties in use, the most common method of storing in- formation is to use a file storage system with spatial segmentation. During the imple- mentation of each project, the user creates a system of spatial polygons, each of which corresponds to a certain file with the spatial information stored in it (points of laser reflections, orthomosaics, etc.). These can be regular rectangles (tablets), trapeziums of nomenclature sheets in state patterns, pieces of buffer zones for a linearly extended object, overlapping polygons of complex shape, and other geometric objects. The connection between vector objects (as a rule, this is GIS data with a layered storage system) and physical files on a disk is usually established using a hyperlink or as a text string with a way to data file location. The disadvantages of this method are quite obvious. First of all, it is needed to store materials in one or several projections without the ability to switch quickly from one projection to another. Projects with such an approach usually cannot be simply combined among themselves, either because of different coordinate systems or differ- ent types of “slicing” and just different technical parameters of the survey. It is also not easy to select the optimal size of the “storage element” - a too small storage area requires constant work with a large number of pieces, too large does not allow loading all the data into RAM or reduces the workstation’s productivity. However, the most unpleasant aspect of such approach is the form of storing laser scanning data in the form of dots itself [5]. Almost all modern data storage formats - Big LIDAR Data in Digital Earth: Ways Out of Dead End 5 PLY, LAS, FBI, BIN and the like - use a data structure in the form of sequential rec- ords for each laser point, each of which contains a time stamp, coordinates, color, amplitude and a number of other parameters. The laser scan data file is thus a gigantic table. Despite the fact that in this type of record the data is spatial, they are not topologi- cal. Even an elementary query like “select points within a radius of X from point Y” will require complete processing of the entire data array in the current data segment. If the point is closer than at a distance X from the border of the current segment, then processing can be stopped and a new segment will need to be loaded (often manual- ly). To avoid such situations, segments are often made overlapping [6], which does not facilitate the operation of algorithms and requires even more disk space. This is a critical issue with a file-based segmented approach to data storage. As the size of the segment decreases, its severity increases. With very dense data (up to 10,000 points per 1 m2 and above), the size of the segment can decrease to 100 and even 50 meters - with a larger amount, RAM overflow will occur. This is a possible situation when implementing large mobile scanning projects. 3.2 Archiving and spatial roughening Partially, the problem of a small data segment, as well as the problem of reducing the total amount of stored data, can be solved by combining a decrease in the spatial accuracy of the data and dynamic archiving of information. The final accuracy of a significant part of airborne laser scanning systems rarely exceeds 20 mm, and mobile scanning systems - 3 mm. However, in most cases, data storage uses a large number of decimal places, up to tenths of a millimeter. Of course, with the above-described real accuracy, this is an overly detailed record of coordi- nates. If we estimate the real accuracy and the necessary data density, we user can make certain optimization. For example, to limit the detail of recording coordinates to 1 cm. In addition, filtering is possible. In this example the points at a distance of 0.1-1 cm from each other are considered as one point (if the parameters of surveying do not give reason to expect such density of data). Due to this, it is possible to compress the volume of data by 20-40%. A similar approach is implemented in Bentley Point Tools, and partially can be implemented in TerraSolid and some other software prod- ucts as well. The coordinates of the points change quite slowly - for each subsequent point, most of the digits in the coordinates and timestamps coincide with the previous and subsequent points. This feature creates good opportunities for implementing data archiving algorithms. In combination with the previously described approach, such methods allow you to compress the original data cloud 2.2-3.5 times (sometimes even more) in comparison with the original data volume. A similar approach is used when storing data in LAZ format. Naturally, this causes a decrease in processing speed, but allows to reduce the load on storage systems. And this methods does not solve the problem of non-topological data. 6 I. Rylskiy 3.3 Geodatabase One of the most powerful features of modern GIS packages (for example, ArcGIS) is the storage of spatial data in the form of a geodatabase. This approach solves the problem of segmentation and reprojection of materials, including the laser scanning data (also as non-topological points). The problem in this case is extremely slow operation with laser data. The reason is that each point is treated as a separate GIS object, with its inherent attributive characteristics, and processed with all care. The price for this is a very large storage capacity and a slow speed. In practice, work- ing with ArcGIS with large amounts of laser data is extremely inconvenient. There are currently no fundamental ways to accelerate ArcGIS work with geodatabases. 3.4 Quadtree and octree The quadtree (4-tree, Q-tree) is a data structure in computer science, in which each node has exactly 4 descendants [7]. Quadrant trees can be used to recursively split two-dimensional space into 4 regions. Each area can be a square, a rectangle, or have an arbitrary shape. The term quadtree was introduced in 1974. At its core, the quadtree allows you to split the space into adaptive cells, depending on the distribution of information. Similarly to a quad tree for two-dimensional space, for three-dimensional space, it is possible to build a similar structure - an octree (di- vides three-dimensional space into 8 octants). The ideology of storing and working with data in the form of a quadtree is used in the well-known Google Earth application for hierarchical ordering and addressing of data elements - small raster images that form the final mosaics of satellite images when working in this application. A partial approach to similar work with data is used in the RiProcess (Riegl) soft- ware, and is also used to accelerate the visualization of point clouds in Bentley Point tools, TerraStereo and Riegl Riscan Pro. However, these software environments are unsuitable for storing large infrastructures of spatial by a number of reasons. So, in Riprocess it is very difficult to export a part of the data by a spatial query - for exam- ple, exporting data within the specified polygon. There is no way to unlimitedly build a data array without rebuilding a quad tree across the entire data volume. And, most importantly, quad trees in these software products are precisely two-dimensional, and not three-dimensional structures. Nevertheless, this software package has already implemented an approach to calculating statistical characteristics for each cell. 4 Proposed approaches to solving the existing problems of storing laser scanning data 4.1 Common issues Taking into account the precious issues, we consider it necessary to note that none of the software environments (including modern GIS packages) offers a reasonable solution for storing and working with large amounts of laser scan data in point form. Big LIDAR Data in Digital Earth: Ways Out of Dead End 7 Previously, such problems took place and were successfully solved in geoinfor- matics by gradually unifying the forms and formats of data and their storage algo- rithms (vector graphics, storage in geographic coordinates, layered representation of data, open formats and algorithms). This process took place actively from 1980 to 2010 and successfully lead us to appearance of a large number of GIS packages that make it easy to manage data of any size and spatial coverage, while providing the ability to export data from one program to another without loss of information. A similar step must be taken in LIDAR data storage. Requirements for storing Big LIDAR Data based on 3D points can be formulated as follows: - the possibility of unlimited increase of data coverage up to global; - the ability to quickly visualize (in any form) spatial data at any level of datail (up to the consideration of individual points and work with them); - the possibility of merging various LIDAR data sources without reprocessing whole volume of previously stored information ; - the possibility of spatial queries and export of individual pieces of LIDAR data; - well-developed system of visualization of materials in 2D and 3D form. Obviously, we can conclude that it is the form of data storage that determines the possibilities for achievement and processing. Indeed, if you look at the history of formats and technologies, it is the flexibility of the format that determines the accept- ability of a particular data processing technology. For example, JPG and MP3 for- mats can be noted. The first opened the way for an explosive increase in the number of stored images and the possibility of transmitting graphics over the Internet. The second revolutionized the field of portable players and online audio streaming. A similar, but less resounding success is in the DXF and GEOTIFF formats. Creation of the suitable and appropriate data format leads to the rapid develop- ment of technologies. The format in this respect is somewhat similar to a cartridge for a firearm. It is the cartridge that is the cornerstone of the rifle complex. Weapons are designed according to cartridge specifications. In the field of spatial data, the situation is similar. 4.2 Estimated specific features of LIDAR data format suitable for Big Data concept. So, by our opinion, the best way to store Big LIDAR Data for the above purposes is to use an octree. The extent of the dataset is a sphere with the dimensions of the Earth. Working with data stored in geographic coordinates (degrees of latitude and longitude) always encounters the problem of translation into any projection and gen- erates difficulties associated with distortions of projections. Therefore, the most con- venient - from the point of view of calculations and the subsequent transition to other coordinate systems - is the Cartesian coordinate system (the origin is in the center of the Earth, coordinates in meters). In this case, the initial space is divided into the ini- tial 8 cubes with a side length of about 8388.608 km (zero level of detail), and then at each level the dimensions of the cube are reduced by 2 times. At the 23rd level, the length of the edge of the cube will reach 1 meter, at the 30th – 1.25 cm, at the 33rd – 1.56 mm. 8 I. Rylskiy In the format, it is necessary to establish a certain spatial limit, which can not be exceeded while decreasing the cube size [8]. With a more detailed approximation in 2D mode, the cube is displayed as a pixel and is no longer detailed, when approaching in 3D mode, points inside the cube are displayed (all). When moving away from the minimum possible cube in 3D mode, it is displayed as a single point, with further hierarchical coarsening as it moves away [9]. For a Digital Earth has global coverage, a reasonable cube size can be 0.25-1.0 m, for local datasets it can be 10 cm or even less. We call a cube of a space of similar size an elementary cube. 4.3 Statistics and metadata For each cube, the statistics of 3D points can be calculated (in advance, when cre- ating the file). This can be the “center of gravity” of the cube (the average position of all its points), the calculation of the average coordinates of the points and their stand- ard deviations for each axis, the averaged color, the amplitude, reflectivity, and the coordinates of the averaging plane (the plane, the sum of the squares of the distances to which from each of the points in the cube is minimal). For each cube, its geograph- ical coordinates are calculated in advance, as well as the coordinates in a widespread projection (UTM). Similarly, the ellipsoidal and orthometric heights of the center of gravity of the cube (adjusted for the geoid model) and the height difference within the cube are calculated. While visualizing such a data structure at any of its sections in 2D (top view) or 3D mode (isometric view) will look like an array of points, where in each pixel of the screen one cube is displayed by one point. The color of the point is determined by the selected display method (color scale) and can display the height, the spread of heights within a given pixel, the amplitude of the reflected signal, and so on (Fig.2). Fig. 2. Riegl RiProcess 2D quadtree data storage. Left – 2D view, quad cells (1 m elementary cells) can be seen. Statistic parameter, used for visualization – height differens of all points inside the cell. Right – the same region visualized in 3D mode by switching to points, stored inside quadtree cells. Colored using attribute «Аmplitude» In addition to spatial properties, metadata is stored for the cube. To store the date and time of receiving the points (if there are multiple shots in the cube, it is displayed in a certain reserved color, and if they are absent, in accordance with the selected Big LIDAR Data in Digital Earth: Ways Out of Dead End 9 color scale, which does not include the reserved color). The time storage detalisation for the cube should be one day; for individual points, the exact GPS time of point measurement is stored, allowing their additional processing of laser reflection points. It may also be provided for storing textual information about the data source, their accuracy and other similar characteristics, which can be stored in reserved data fields. 4.4 Approaches for saving and merging 3D BIG LIDAR DATA volumes on the drive Storing data in a format whose ideology is described above will certainly run into the difficulties of storing infinitely large amounts of data in one file. Technically, this is possible. Nevertheless, in Google Earth the similar task (storage of the huge image cache) is solved by storing a large number of small files (pieces of rasters), and when replenishing the system, it is simple to add more detailed pieces of rasters to the fold- er to existing rasters. Similar approach can be implemented in this case. Of course, storing data in the form of “one elementary cube - one file” is unreasonable. But it is quite possible to introduce storage of larger cubes (100-1000 m in size) in the form of separate files, and already inside them to save data of elementary cubes. We call such a larger cube a “group cube”. In this case, when adding a new dataset to the old one, it will be necessary to simply rewrite some new group cubes to a new location, and re-calculate some of them by updating the statistics and metadata of each updated group cube. However, the rest of the data will not have to be recalculated. Exporting of the data can be done in the similar way. 5 Conclusion Creating a unified data format based on algorithms for working with the octree will allow solving a whole group of BigData problems in the field of laser scanning for Digital Earth: the need to structure large segments of data, quick access to arbi- trary sections of data, and also minimize the cost of merging and combining overlap- ping sections of simultaneous data. An important feature of the data structure in the form of an octree when storing la- ser points is the ability to create topological data structures that, in addition to statisti- cal and spatial information, contain information about neighboring data segments. Undoubtedly, such an organization of information is significantly more complicat- ed than the currently accepted methods of storage in open formats (LAS, BIN, FBI, XYZ). However, without solving this problem, further development and mass imple- mentation of the use and analysis of laser scanning data will be extremely difficult. 10 I. Rylskiy References 1. RIEGL VZ6000 homepage, http://www.riegl.com/nc/products/terrestrial- scanning/produktdetail/product/scanner/33/ Last accessed 10 Jul 2020 2. RIEGL 1560ii homepage, http://www.riegl.com/nc/products/airborne- scanning/produktdetail/product/scanner/68/. Last accessed 13 Jul 2020 3. RIEGL VMX RAIL homepage, http://www.riegl.com/nc/products/mobile- scanning/produktdetail/product/scanner/67/ Last accessed 19 May 2020 4. Giuliani, G., Chatenoux B., De Bona A.: Building an Earth Observation Data Cube: les- sons learned from the Swiss Data Cube (SDC) on generating Analysis Ready Data (ARD). Big Earth Data 1-2 (1), 100–117 (2017) 5. Chen Q.: Airborne lidar data processing and information extraction. Photogrammetric En- gineering & Remote Sensing 2 (73), 109–112 (2007) 6. Rylskiy I.A., Markova O.I., Eremchenko E.N., Panin A.N.: Building object-based virtual models based upon terrain laser scanning and uav data. International scientific review 6 (71), 75–83 (2020) 7. Octree homepage, https://en.wikipedia.org/wiki/Octree Last accessed 16 Jan 2020 8. Anh-Vu V., Truong-Hong L., Laefer D., Bertolotto M.: Octree-based region growing for point cloud segmentation. ISPRS Journal of Photogrammetry and Remote Sensing 104, 88–100 (2015) 9. Wang M., Tseng Y: Incremental segmentation of lidar point clouds with an oc- tree‐structured voxel space. The Photogrammetric record 26 (133), 32–57 (2011)