Introduction

PointNet with Spin Images

0 Charles University, Faculty of Mathematics and Physics, Department of Software and Computer Science Education , Prague , Czech Republic

85 96

Machine learning on 3D point clouds is challenging due to the absence of natural ordering of the points. PointNet is a neural network architecture capable of processing such unordered point sets directly, which has achieved promising results on classi cation and segmentation tasks. We explore methods of utilizing point neighborhood features within PointNet and their impact on classi cation performance. We propose neural models that operate on point clouds accompanied by point features. The results of our experiments suggest that traditional spin image representations of point neighborhoods can improve classication e ectiveness of PointNet on datasets comprised of objects that are not aligned into canonical orientation. Furthermore, we introduce a feature-based alternative to spatial transformer, which is a sub-network of PointNet responsible for aligning misaligned objects into canonical orientation. Additional experiments demonstrate that the alternative might be competitive with spatial transformer on challenging datasets.

Introduction

Machine analysis of 3D geometrical data is becoming an important area of research because of the increasing demand from applications such as autonomous driving. Thanks to the advances in the development of depth sensors, large amounts of such data are publicly available, which makes development and employment of data-oriented algorithms more accessible.

Convolutional neural networks (CNNs) have established state-of-the-art results in computer vision tasks such as image classi cation, but their application on tasks involving 3D data remains a problem. CNNs rely on regular grid representations that are very memory demanding and computationally expensive to process in 3D. CNNs were already utilized on voxel data, but even with optimization like hierarchical octrees, this solution is limited to grids of resolution 2563 and will probably be very di cult to scale to ner resolutions.

Point cloud representation is appealing alternative to voxel representation for several reasons. The data sparsity is naturally re ected in point cloud representation which is typically much more concise compared to voxel representation. There is no trade-o between precision and memory demands like in the case of voxel representation and point cloud can capture an arbitrary level of detail. Point clouds are also close to raw measurements of sensors as LiDAR or RGBD cameras. Automatic machine analysis of point cloud representation is, however, Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). challenging, mainly because the points of a point cloud have no ordering so any permutation of the points represents the same point cloud.

PointNet [ 8 ] is a neural network architecture designed to process point cloud representations directly. It obtains hidden representation of each input by independently processing each point by a Multi Layer Perceptron (MLP). Those representations are then aggregated by maximum pooling to obtain a permutation invariant representation. PointNet provided a considerable boost in computational e ciency of 3D object classi cation while keeping up in the terms of classi cation performance with other state-of-the-art approaches. Furthermore, the model is also straightforwardly applicable to other useful tasks involving 3D data such as the task of point cloud segmentation. PointNet e ectively samples the 3D domain via so called point functions. But, unlike e.g. voxelization, it works in an e cient and data dependent way. Unfortunately, when used on objects appearing in an arbitrary orientation, the e ectiveness of sampling the 3D domain seems limited, as the number of locations in which the points can be located is greatly increased.

The authors utilize spatial transformer network [ 2 ] in order to deal with this issue, but aligning point clouds to canonical orientation is a di cult task, which would itself require recognition of object classes in some cases. Furthermore, the spatial transformer itself relies on PointNet within PointNet, so the alignment capabilities of spatial transformer share the limitations of PointNet.

It seems intuitive that additional local information extracted from point neighborhoods could be bene cial for classi cation, especially in the case when point clouds are not aligned to canonical orientation. A successor of PointNet called PointNet++ [ 9 ] was introduced in order to add capabilities of utilizing the local neighborhood features into PointNet by applying a small PointNet on point neighborhoods and repeating the process on gradually higher-dimensional point clouds.

In this paper, we follow the direction of PointNet++ towards adding local point features into PointNet. We focus on the tasks in which the input objects are not aligned into canonical position. We develop models based on rotation invariant point features and PointNet. Several experiments were conducted in order to compare our models with the PointNet baselines. Our model manifests comparable classi cation performance on datasets with manually aligned objects and noticeably better performance on datasets in which objects are oriented arbitrarily. We also propose a simple feature-based heuristic for point cloud alignment in the form of a neural network layer, and we empirically show that our heuristic can be more e ective than spatial transformer in certain cases. 2

Related work

There are several ways of applying machine learning to 3D point clouds that are currently actively researched. One common way is to transform the point cloud representation to voxel grid representation, which can be processed by 3D CNNs [ 7 ], [ 12 ], [ 10 ]. Scaling these methods to classi cation of complex objects which require ne level of detail to be distinguished is, nevertheless, di cult due to the inherent trade-o between managable memory demands and admissible loss of information.

Sequences of images obtained by rendering the point cloud representation from di erent view-points is another grid-based representation which can be processed naturally by 2D CNNs [ 4 ], [ 11 ], [ 10 ]. These approaches have established state-of-the-art results on classi cation benchmarks. Limitation of these methods are di culty of their extension for di erent tasks like the point cloud segmentation. The point cloud representations are also in principle capable of capturing more complex data than surfaces and such data would be di cult to render into images without potentially loosing important information.

Point clouds can also be processed directly by several recent models. PointNet [ 8 ] applies a neural network on every input coordinate of the point cloud independently and extracts a permutation invariant representation by applying global pooling. Spatial transformer is applied on the input coordinates to deal with variance of input orientations and is also applied on the hidden representations. From the reported results, it is, however, not clear how the model would perform on datasets with objects of highly varying pose. PointNet++ [ 9 ] utilizes small PointNet networks on point neighborhoods across several scales in order to introduce local point features to the original architecture. Such local features are powerful, since they are learned from the point cloud data directly, but they not invariant under rotations, which might cause a decrease of classi cation performance on unaligned data. Kd-Net [ 5 ] allows convolution-like processing of point clouds by building a balanced kd-tree and then following a bottom-up traversal of the tree, while applying learned a ne transformation and non-linearity on features contained in child nodes in each parent node. Kd-net is also not invariant under rotations and could also potentially bene t from rotation invariant local features. 3

Methods

Our work extends PointNet [ 8 ] by using local point features in a way that is similar to PointNet++ [ 9 ]. We focus on features which are rotation invariant, and we investigate if such features have a positive impact on classi cation on unaligned data. In this section, we describe feature extraction techniques and our method for aligning point clouds. Section 5 describes the exact models derived from methods of this section. 3.1

Spin Images

There is a plethora of descriptors invariant under rigid transformations which could be incorporated into PointNet. In this work, we opt for spin images [ 3 ] primarily because the representation of spin images can be straightforwardly processed by empirically succesfull CNNs. We leave investigation of other descriptors for future work. We brie y summarize the spin images technique here. n p α x β

In order to extract a spin image of a neighborhood around a point p 2 R3, knowledge of a normal vector n 2 R3 associated with p is required. For classi cation of point clouds that represent surfaces, this is not a very restrictive assumption, since normal vectors can be estimated from eigenvalue decomposition of local covariance matrices [ 1 ].

Given input points of a neighborhood around p, every input point x is projected to the new coordinates ( ; ) indicated in Figure 1 and accumulated into a two-dimensional histogram. If the points carry additional information in the form of a vector such as the color, the vector can also be accumulated into bins for example by addition. Spin images have appealing properties. Their descriptiveness is easily adjusted by changing the histogram resolution. They can also be made local and global point cloud descriptors by changing the size of the point neighborhood. 3.2

Spin Coordinates

Spin image coordinate transformation also provides a straightforward way to make PointNet++ features invariant under rotation, simply by using the transformation on local point clouds before they are processed by local PointNets of PointNet++. This is essentially equivalent to forcing the point functions learned by PointNet to be axially symmetrical around the local normal vector. We will refer to these features as the spin coordinates. 3.3

Orientation Alignment Layer

For a given object, in the form of point cloud and corresponding point features, if we were able to select points which are accompanied by distinctive features, then objects of the same class could be approximately aligned in a coordinate system that would be based on these points. Based on this idea, we designed a simple heuristic algorithm, which we call orientation alignment layer (Algorithm 1). The algorithm is also easily extensible to the problem of pose alignment, but we will only consider alignment for simplicity (see Section 6.1 for clari cation of pose and orientation). Algorithm 1 rotates an input point cloud so that points with selected features would be positioned in a direction of canonically chosen orthogonal vectors.

Algorithm 1 Orientation Alignment Layer(X ) Input: X = (xi)in=1 where xi 2 R3+d Output: sequence of n rotated points from X

n Let (ci)in=1 ; ci = (xi;1; xi;2; xi;3); xi 2 X Let (fi)i=1 ; fi = (xi;4; xi;5; :::xi;3+d); xni 2 X i; j Feature Selection Heuristic((fi)i=1; 2) Algorithm 2 Let x; y 2 R3 be two orthogonal unit vectors chosen canonically R1 the rotation matrix such that R1 kcciik = x v R1cj v v (v x)x R2 the rotation matrix such that R2 kvvk = y return (R2R1ci; fi)in=1 . sequence of coordinates and features . sequence of coordinates . sequence of features

Features that are common within a class but uncommon within an individual point cloud could be good candidates for the selection. Selection of features that are frequent within a class provides consistent orientation of objects within the same class. Furthermore, selection of features that are unique within a point cloud provides robustness in case of presence of multiple good candidates in the point cloud. These rather abstract qualities are, however, not straightforward to de ne and compute quantitatively.

We have chosen simple heuristic approach to the feature selection described by Algorithm 2. We do not have satisfactory justi cation of the heuristic, but it seems intuitive that selecting the features with maximal entries could provide at least somewhat consistent selection. Besides, when the heuristic is applied within hierarchical PointNet which apply max pooling of local features, we assume that the maximal features are likely to be important for classi cation. Algorithm 2 Feature Selection Heuristic(F , k) Input: F = (fi)in=1 where fi 2 Rd

k = number of features to be selected Output: k integer indices of selected feature vectors

F 0 (fi0)in=1; fi0 = max fi return indices of k largest elements of F 0 . sequence of features . maximum entries of features 4

Datasets

Our experiments were based on datasets which are described in this section.

ModelNet

The Princeton ModelNet dataset [ 12 ] has two variants: ModelNet10 which contains 4899 objects of 10 categories and ModelNet40 which contains 12311 objects of 40 categories. We use point clouds consisting of 1024 points extracted from the original CAD models by [ 8 ]. In the case of ModelNet10, the individual objects are manually aligned (each object has the identical pose). The objects are centered and scaled so that each object ts into the unit ball. We use the original train/test splits consisting of 3991/908 objects from ModelNet10 and 9843/2468 objects from ModelNet40. We further split the train partitions for the purpose of validation.

Augmented ModelNet10

We prepared a challenging modi cation of the ModelNet10 dataset by replacing each original object with two modi ed copies. Every object is subject to random rotation of angle up to . The objects are translated by a vector of random direction and of random length from uniform distribution on [0; 0:25]. Additionally, up to 3 cubes of random size and orientation are inserted into each point cloud, so that they never intersect with the original objects. The inserted cubes were represented by 50 points, and their maximum size was 0:5 0:5 0:5.

SHREC17

A subset of the ShapeNet dataset consisting of 51,162 triangle meshes of objects. We use the provided 70%/10%/20% training/validation/test split for the experiments. There are two variants of the SHREC17: normal and perturbed. Here, we use the perturbed dataset where the objects are subjected to random rotations. Point clouds are sampled from the provided triangle meshes by sampling the triangles with probability proportional to their area, and then sampling the triangle surfaces uniformly so that the obtained point clouds are consistent with ModelNet point clouds. We use 1024 sampled points for the classi cation and an additional feature engineering as required. For the methods that require normal vectors, we calculate the normal vectors from the meshes rather than from the sampled point clouds. It should be noted that there are both inwardpointing and outer-pointing normal vectors in every mesh, which most likely hinders the performance of some of the methods relying on the normal vectors to some extent. 5

Models

In this section we provide a description for every model that will be evaluated in the next section. The model architectures were selected so that their sizes would be roughly comparable in terms of the number of learnable parameters. We did not ne-tune hyperparameters in this work as we were mainly interested in observing major di erences of models, and we did not intent to achieve the best performance. 1 PointNet: A small PointNet model. The shared MLP part of the model before the maximum pooling is formed of fully connected layers with sizes 64, 64, 64, and 256 neurons. The MLP part of the model after maximum pooling consists of dropout with probability 0.2 and two fully connected layers with 512 and 128 neurons.

2 PointNetST: The same model as the previous PointNet model, but a spatial transformer parametrized by linear or a ne transformations is additionally inserted as the rst layer for appropriate tasks, that is: the linear transformer for the tasks where the input objects are possibly rotated but not translated and the a ne transformer for the rest. The spatial transformer itself is a PointNet consisting of layers with 32, 32, and 128 neurons before the maximum function and then a single layer of 128 neurons.

3 PointNetSTL: Re-implementation of the original PointNet [ 8 ] with 2 differences: we only use the rst spatial transformer and we do not utilize batchnormalization layers.

4 Spin Images: This model makes predictions based on spin images only and does not utilize the coordinates of the features. Spin images are of size 32 32. Thirty-two spin images of radius 1 are utilized. The 32 points are selected by farthest sampling algorithm. The model consists of 3D convolutional layers with 32, 64 and 128 lters of size 1 3 3 followed by 1 2 2, 1 2 2 and 32 1 1 maximum pooling, respectively, followed by dropout with probability 0:2 and fully connected layers with 512 and 256 neurons.

5 Hierarchical Spin Images: This model utilizes both the representations obtained from the spin images and the coordinates. Spin images are of size 16 8. Thirty-two spin images of radius 0:6 are utilized. Spin images are processed by the same 3D CNN from previous model, then the representation is concatenated with point coordinates and fed into PointNet (Model 1).

6 Hierarchical PointNet: Thirty-two point neighborhoods, each consisting of 32 nearest points, are utilized. Each neighborhood is processed by a small PointNet consisting of layers of 32, 32, and 32 neurons followed by max pooling and 64 neurons. The extracted embedding is concatenated with point coordinates and fed into PointNet (Model 1).

7 Hierarchical PointNet Spin Coordinates: The same model as the Hierarchical PointNet, but the local coordinates are rst transformed using the spin image coordinate transformation.

8 Hierarchical PointNet Orientation Alignment: This is the same as

the previous model, except that orientation alignment layer (see Algorithm 1) is additionally inserted after the concatenation of embeddings and coordinates. 6

Experiments

In this section, we describe the experiments that were carried out in order to empirically compare suggested models and features. In order to evaluate performance of the models, we measured classi cation accuracy on o cial test sets given in the respective benchmark tasks. We further split the original training sets into two parts for training and validation. We evaluate performance of models after each epoch of training on validation partition of data. A model with the best performance on validation data across all epochs is taken as a result of the training. The categories are not balanced in terms of their frequencies, so the data are split in strati ed manner meaning that frequencies of the categories is the same in training and validation parts. We use the Adam optimization algorithm with parameters ( = 0:001, 1 = 0:9, 2 = 0:99) and batch size 128. Training strategy is adjusted to take imbalanced categories into account by lling batches in a way such that categories are uniformly distributed in each batch. We apply L2-regularization of network weights with = 0:0001. 6.1

Robustness to Rotations

Let us informally de ne notions about object orientation in order to clarify descriptions of experiments from this section. We assume that every object has a unique reference pose which is given by semantics. Reference pose is described by a canonical coordinate system. Pose of an observed object is then the coordinate system (taken w.r.t. the canonical system) in which the object is in its reference pose. When we refer to orientation of an object, we mean the pose of the object without translation element, i.e. the coordinate systems are zero centered. We will also use the notion of orientation vector, by which we mean a vector parallel to one canonically chosen axis of the orientation coordinate system.

With the following experiment, we tested robustness of PointNet against rotations of point cloud objects. We augmented the ModelNet10 dataset by rotating the objects from dataset randomly. Two rotated samples of each object were placed into the augmented dataset instead of each original object. We then performed 10-fold cross-validation on the augmented dataset to evaluate classi cation accuracy. The models PointNet and PointNetST (see Model 1 and 2) were subject to the experiment.

The rotation matrices used for rotating the objects were sampled in such a way, so that orientation vectors of all objects were distributed uniformly on a cap of the unit sphere with the apex at the original orientation vectors. The tested maximal angles of the rotations were 4 , 2 , 34 , and .

Table 1 reveals that PointNet without spatial transformer is quite robust versus rotations. Higher accuracy could be achieved with more augmentation and further regularization techniques. Nevertheless, the decrease of accuracy is noticeable. Spatial transformer clearly helps for the rotations of small angles, but does not seem to help for the rotations of large angles, where the accuracy is nearly the same for the PointNet and PointNetST models. 6.2

Entropy of Orientation Distributions

Spatial transformer was designed to decrease variance of orientations of the point cloud objects present in the data. It seems highly probable that increase of accuracy is correlated with the decrease of variance of orientations, but if we wanted to compare other mechanisms for decreasing orientation variance, it might be better not to rely solely on accuracy which might be also a ected by other factors. With access to the original orientation of each object, we can directly measure how the variance of the orientations is a ected by transformations produced by spatial transformer or other techniques.

We have only considered orientation vectors in this experiment for simplicity, even though an orientation vector v is not su cient to fully describe orientation of an object in 3D since the rotation component around v is left unspeci ed. The unit orientation vectors of the objects can be viewed as samples from a distribution X on the two-dimensional unit sphere S embedded in R3, which we will refer to as the orientation distribution. The di erential entropy H(X) of the distribution X with a probability density function f whose support is S de ned as:

Z H(X) = f (x) log (f (x)) dx (1) x2S is a measure (not in the mathematical sense) of uncertainty of the distribution. The lower the entropy of orientation distribution within a dataset is, the more aligned the dataset is. By comparing entropy of the orientation distribution of the augmented input data and the data transformed by the spatial transformer, we can observe whether the spatial transformer performs alignment of the objects or not. We do not have access to the probability density function directly for computation of the entropy, but we can estimate the entropy from samples. We chose the Kozachenko-Leonenko entropy estimator [ 6 ] for the purpose because it relies on pairwise distances of the samples, which can be computed trivially, whereas other approaches to the problem, e.g. that rely on density estimation, are not so straightforwardly applicable on spherical distributions.

Let X = (x1; x2; :::; xn), xi 2 Rd be the samples from the distribution subject to the entropy estimation. Let (di)in=1 be the distances of the samples xi to their k-th nearest neighbors, then the Kozachenko-Leonenko estimate can be written as:

Hb (X) = (n)

n (k) + log(c) + d X log(di) n i=1 (2) where is the digamma function, and c is the volume of the unit ball dependent on the norm used to calculate the distances. In the case of X being distribution on the unit sphere, the distance is de ned by the angle between samples, and the c = 2 (1 cos 1) is the surface area of the spherical cap with the unit angle between the apex and the edge.

We repeated the experiment from Section 6.1 and we estimated the di erential entropy of the orientation distributions of the data before and after application of the transformations generated by the spatial transformer. Since spatial transformer is not restricted to produce only orthogonal transformations, we normalized the transformed orientation vectors in order to obtain spherical distribution.

Results of the entropy experiment are given in Table 2. We can see that there is a relation between accuracy and di erential entropy of orientation distribution by comparing the results with the experiment from previous section summarized by Table 1, where spatial transformer was most helpful in the cases of maximum angle up to 2 , which was also the case in this experiment. The experiment suggests that the spatial transformer probably helps in certain cases, but it is likely not a universal remedy for the problem of pose alignment. Careful adjustment of the spatial transformer hyper-parameters might be needed in order to enjoy its bene ts. Orientation alignment layer performed better than spatial transformer on fully uniform rotations with more signi cant entropy reduction. Disadvantage of orientation alignment layer is that it performs in a way that is independent on the input orientation distribution entropy by the nature of Algorithm 1, so the entropy after transformation is the same for all tests. 6.3

Benchmarks

On datasets which are mostly aligned (ModelNet datasets), the PointNet models 1{3 perform well and additional local features wre not helpful for classi cation. On the other two datasets, which are perturbed by rotations and additionally translations in the case of Augmented ModelNet10, the Model 4 was superior to others probably because of its invariance under rigid transformations. The Model 5 seems stable in the sense that it is never substantially worse than the best model in each task, so it seems that a combination of rotation invariant features with absolute point coordinates is a promising direction.

The performance of the Model 8, which utilizes orientation alignment layer, was inferior in most cases, because the orientation alignment layer is harmful when the data are well aligned. However, we see that in the SHREC17 task, the presence of orientation alignment is bene cial compared to Models 6 and 7, which indicates that reduction of orientation distribution entropy was achieved. 7

Conclusion

We have empirically demonstrated that PointNet can bene t from point neighborhood features on classi cation tasks where objects represented by point clouds may appear in arbitrary orientation. Spin images seem to be promising candidates of point neighborhood features. Experiments also suggest that the spatial transformer technique, employed by PointNet in order to deal with the problem of object orientation alignment, may be di cult to utilize properly depending on the data. We have also proposed a simple experiment to measure quality of alignment achieved by spatial transformer interpretatively on the tasks where orientation of objects is known in advance. We have also introduced a simple heuristic algorithm as an alternative to spatial transformer, which we call orientation alignment layer. Further experiments suggest that orientation normalization layer might be able to achieve better quality of orientation alignment than spatial transformer on di cult data.

In the future work, we would like to design better feature selection method for our orientation alignment layer in order to make it more robust. We believe that spatial transformer could also bene t from local point features, and we would like to investigate the idea. It would also be possible to combine spatial transformer and orientation alignment layer into single model. Finally, we intent to compare spin images with other rotation invariant features.

1. Hoppe , H., DeRose, T., Duchamp , T. , McDonald , J. , Stuetzle , W. : Surface reconstruction from unorganized points . SIGGRAPH Comput. Graph . 26 ( 1992 )

2. Jaderberg , M. , Simonyan , K. , Zisserman , A. , kavukcuoglu, k.: Spatial transformer networks . In: Advances in Neural Information Processing Systems 28 , pp. 2017 { 2025 ( 2015 )

3. Johnson , A.E. , Hebert , M.: Using spin images for e cient object recognition in cluttered 3d scenes . In: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) . pp. 433 { 449 ( 1999 )

4. Kanezaki , A. , Matsushita , Y. , Nishida , Y. : Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints . In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ( 2018 )

5. Klokov , R. , Lempitsky , V. : Escape from cells: Deep kd-networks for the recognition of 3d point cloud models . In: 2017 IEEE International Conference on Computer Vision (ICCV) ( 2018 )

6. Kozachenko , L.F. , Leonenko , N.N. : Sample estimate of the entropy of a random vector . Probl. Peredachi Inf . 23 , 9 { 16 ( 1987 )

7. Maturana , D. , Scherer , S. : Voxnet: A 3d convolutional neural network for real-time object recognition . In: IEEE/RSJ International Conference on Intelligent Robots and Systems ( 2015 )

8. Qi , C.R. , Su , H. , Mo , K. , Guibas , L.J. : Pointnet: Deep learning on point sets for 3d classi cation and segmentation . Proc. Computer Vision and Pattern Recognition (CVPR), IEEE ( 2017 )

9. Qi , C.R. , Yi , L. , Su , H. , Guibas , L.J. : Pointnet++: Deep hierarchical feature learning on point sets in a metric space . Neural Information Processing Systems (NIPS) ( 2017 )

10. Qi , C.R. , Su , H. , Nie ner, M., Dai , A. , Yan , M. , Guibas , L. : Volumetric and multiview cnns for object classi cation on 3d data . In: Proc. Computer Vision and Pattern Recognition (CVPR), IEEE ( 2016 )

11. Su , H. , Maji , S. , Kalogerakis , E. , Learned-Miller , E.G. : Multi-view convolutional neural networks for 3d shape recognition . In: Proc. ICCV ( 2015 )

12. Wu , Z. , Song , S. , Khosla , A. , Yu , F. , Zhang , L. , Tang , X. , Xiao , J.: 3d shapenets: A deep representation for volumetric shapes . In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR ( 2015 )