<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PointNet with Spin Images</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University, Faculty of Mathematics and Physics, Department of Software and Computer Science Education</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <fpage>85</fpage>
      <lpage>96</lpage>
      <abstract>
        <p>Machine learning on 3D point clouds is challenging due to the absence of natural ordering of the points. PointNet is a neural network architecture capable of processing such unordered point sets directly, which has achieved promising results on classi cation and segmentation tasks. We explore methods of utilizing point neighborhood features within PointNet and their impact on classi cation performance. We propose neural models that operate on point clouds accompanied by point features. The results of our experiments suggest that traditional spin image representations of point neighborhoods can improve classication e ectiveness of PointNet on datasets comprised of objects that are not aligned into canonical orientation. Furthermore, we introduce a feature-based alternative to spatial transformer, which is a sub-network of PointNet responsible for aligning misaligned objects into canonical orientation. Additional experiments demonstrate that the alternative might be competitive with spatial transformer on challenging datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Machine analysis of 3D geometrical data is becoming an important area of
research because of the increasing demand from applications such as autonomous
driving. Thanks to the advances in the development of depth sensors, large
amounts of such data are publicly available, which makes development and
employment of data-oriented algorithms more accessible.</p>
      <p>Convolutional neural networks (CNNs) have established state-of-the-art
results in computer vision tasks such as image classi cation, but their application
on tasks involving 3D data remains a problem. CNNs rely on regular grid
representations that are very memory demanding and computationally expensive to
process in 3D. CNNs were already utilized on voxel data, but even with
optimization like hierarchical octrees, this solution is limited to grids of resolution
2563 and will probably be very di cult to scale to ner resolutions.</p>
      <p>Point cloud representation is appealing alternative to voxel representation for
several reasons. The data sparsity is naturally re ected in point cloud
representation which is typically much more concise compared to voxel representation.
There is no trade-o between precision and memory demands like in the case
of voxel representation and point cloud can capture an arbitrary level of detail.
Point clouds are also close to raw measurements of sensors as LiDAR or RGBD
cameras. Automatic machine analysis of point cloud representation is, however,
Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
challenging, mainly because the points of a point cloud have no ordering so any
permutation of the points represents the same point cloud.</p>
      <p>
        PointNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a neural network architecture designed to process point cloud
representations directly. It obtains hidden representation of each input by
independently processing each point by a Multi Layer Perceptron (MLP). Those
representations are then aggregated by maximum pooling to obtain a
permutation invariant representation. PointNet provided a considerable boost in
computational e ciency of 3D object classi cation while keeping up in the terms of
classi cation performance with other state-of-the-art approaches. Furthermore,
the model is also straightforwardly applicable to other useful tasks involving 3D
data such as the task of point cloud segmentation. PointNet e ectively
samples the 3D domain via so called point functions. But, unlike e.g. voxelization,
it works in an e cient and data dependent way. Unfortunately, when used on
objects appearing in an arbitrary orientation, the e ectiveness of sampling the
3D domain seems limited, as the number of locations in which the points can be
located is greatly increased.
      </p>
      <p>
        The authors utilize spatial transformer network [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in order to deal with this
issue, but aligning point clouds to canonical orientation is a di cult task, which
would itself require recognition of object classes in some cases. Furthermore, the
spatial transformer itself relies on PointNet within PointNet, so the alignment
capabilities of spatial transformer share the limitations of PointNet.
      </p>
      <p>
        It seems intuitive that additional local information extracted from point
neighborhoods could be bene cial for classi cation, especially in the case when
point clouds are not aligned to canonical orientation. A successor of PointNet
called PointNet++ [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] was introduced in order to add capabilities of utilizing
the local neighborhood features into PointNet by applying a small PointNet on
point neighborhoods and repeating the process on gradually higher-dimensional
point clouds.
      </p>
      <p>In this paper, we follow the direction of PointNet++ towards adding local
point features into PointNet. We focus on the tasks in which the input objects
are not aligned into canonical position. We develop models based on rotation
invariant point features and PointNet. Several experiments were conducted in
order to compare our models with the PointNet baselines. Our model manifests
comparable classi cation performance on datasets with manually aligned objects
and noticeably better performance on datasets in which objects are oriented
arbitrarily. We also propose a simple feature-based heuristic for point cloud
alignment in the form of a neural network layer, and we empirically show that
our heuristic can be more e ective than spatial transformer in certain cases.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        There are several ways of applying machine learning to 3D point clouds that are
currently actively researched. One common way is to transform the point cloud
representation to voxel grid representation, which can be processed by 3D CNNs
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Scaling these methods to classi cation of complex objects which
require ne level of detail to be distinguished is, nevertheless, di cult due to the
inherent trade-o between managable memory demands and admissible loss of
information.
      </p>
      <p>
        Sequences of images obtained by rendering the point cloud representation
from di erent view-points is another grid-based representation which can be
processed naturally by 2D CNNs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. These approaches have
established state-of-the-art results on classi cation benchmarks. Limitation of these
methods are di culty of their extension for di erent tasks like the point cloud
segmentation. The point cloud representations are also in principle capable of
capturing more complex data than surfaces and such data would be di cult to
render into images without potentially loosing important information.
      </p>
      <p>
        Point clouds can also be processed directly by several recent models.
PointNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] applies a neural network on every input coordinate of the point cloud
independently and extracts a permutation invariant representation by applying
global pooling. Spatial transformer is applied on the input coordinates to deal
with variance of input orientations and is also applied on the hidden
representations. From the reported results, it is, however, not clear how the model would
perform on datasets with objects of highly varying pose. PointNet++ [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] utilizes
small PointNet networks on point neighborhoods across several scales in order to
introduce local point features to the original architecture. Such local features are
powerful, since they are learned from the point cloud data directly, but they not
invariant under rotations, which might cause a decrease of classi cation
performance on unaligned data. Kd-Net [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] allows convolution-like processing of point
clouds by building a balanced kd-tree and then following a bottom-up traversal
of the tree, while applying learned a ne transformation and non-linearity on
features contained in child nodes in each parent node. Kd-net is also not
invariant under rotations and could also potentially bene t from rotation invariant
local features.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>
        Our work extends PointNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] by using local point features in a way that is
similar to PointNet++ [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We focus on features which are rotation invariant,
and we investigate if such features have a positive impact on classi cation on
unaligned data. In this section, we describe feature extraction techniques and our
method for aligning point clouds. Section 5 describes the exact models derived
from methods of this section.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Spin Images</title>
        <p>
          There is a plethora of descriptors invariant under rigid transformations which
could be incorporated into PointNet. In this work, we opt for spin images [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
primarily because the representation of spin images can be straightforwardly
processed by empirically succesfull CNNs. We leave investigation of other
descriptors for future work. We brie y summarize the spin images technique here.
n
p
α
x
β
        </p>
        <p>
          In order to extract a spin image of a neighborhood around a point p 2 R3,
knowledge of a normal vector n 2 R3 associated with p is required. For
classi cation of point clouds that represent surfaces, this is not a very restrictive
assumption, since normal vectors can be estimated from eigenvalue
decomposition of local covariance matrices [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>Given input points of a neighborhood around p, every input point x is
projected to the new coordinates ( ; ) indicated in Figure 1 and accumulated into
a two-dimensional histogram. If the points carry additional information in the
form of a vector such as the color, the vector can also be accumulated into bins
for example by addition. Spin images have appealing properties. Their
descriptiveness is easily adjusted by changing the histogram resolution. They can also
be made local and global point cloud descriptors by changing the size of the
point neighborhood.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Spin Coordinates</title>
        <p>Spin image coordinate transformation also provides a straightforward way to
make PointNet++ features invariant under rotation, simply by using the
transformation on local point clouds before they are processed by local PointNets of
PointNet++. This is essentially equivalent to forcing the point functions learned
by PointNet to be axially symmetrical around the local normal vector. We will
refer to these features as the spin coordinates.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Orientation Alignment Layer</title>
        <p>For a given object, in the form of point cloud and corresponding point features,
if we were able to select points which are accompanied by distinctive features,
then objects of the same class could be approximately aligned in a coordinate
system that would be based on these points. Based on this idea, we designed a
simple heuristic algorithm, which we call orientation alignment layer (Algorithm
1). The algorithm is also easily extensible to the problem of pose alignment, but
we will only consider alignment for simplicity (see Section 6.1 for clari cation of
pose and orientation). Algorithm 1 rotates an input point cloud so that points
with selected features would be positioned in a direction of canonically chosen
orthogonal vectors.</p>
        <p>Algorithm 1 Orientation Alignment Layer(X )
Input: X = (xi)in=1 where xi 2 R3+d
Output: sequence of n rotated points from X</p>
        <p>n
Let (ci)in=1 ; ci = (xi;1; xi;2; xi;3); xi 2 X
Let (fi)i=1 ; fi = (xi;4; xi;5; :::xi;3+d); xni 2 X
i; j Feature Selection Heuristic((fi)i=1; 2) Algorithm 2
Let x; y 2 R3 be two orthogonal unit vectors chosen canonically
R1 the rotation matrix such that R1 kcciik = x
v R1cj
v v (v x)x
R2 the rotation matrix such that R2 kvvk = y
return (R2R1ci; fi)in=1
. sequence of coordinates and features
. sequence of coordinates
. sequence of features</p>
        <p>Features that are common within a class but uncommon within an individual
point cloud could be good candidates for the selection. Selection of features that
are frequent within a class provides consistent orientation of objects within the
same class. Furthermore, selection of features that are unique within a point
cloud provides robustness in case of presence of multiple good candidates in the
point cloud. These rather abstract qualities are, however, not straightforward to
de ne and compute quantitatively.</p>
        <p>We have chosen simple heuristic approach to the feature selection described
by Algorithm 2. We do not have satisfactory justi cation of the heuristic, but it
seems intuitive that selecting the features with maximal entries could provide at
least somewhat consistent selection. Besides, when the heuristic is applied within
hierarchical PointNet which apply max pooling of local features, we assume that
the maximal features are likely to be important for classi cation.
Algorithm 2 Feature Selection Heuristic(F , k)
Input: F = (fi)in=1 where fi 2 Rd</p>
        <p>k = number of features to be selected
Output: k integer indices of selected feature vectors</p>
        <p>F 0 (fi0)in=1; fi0 = max fi
return indices of k largest elements of F 0
. sequence of features
. maximum entries of features
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Datasets</title>
      <p>Our experiments were based on datasets which are described in this section.</p>
      <sec id="sec-4-1">
        <title>ModelNet</title>
        <p>
          The Princeton ModelNet dataset [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] has two variants: ModelNet10 which
contains 4899 objects of 10 categories and ModelNet40 which contains 12311 objects
of 40 categories. We use point clouds consisting of 1024 points extracted from
the original CAD models by [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In the case of ModelNet10, the individual
objects are manually aligned (each object has the identical pose). The objects are
centered and scaled so that each object ts into the unit ball. We use the original
train/test splits consisting of 3991/908 objects from ModelNet10 and 9843/2468
objects from ModelNet40. We further split the train partitions for the purpose
of validation.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Augmented ModelNet10</title>
        <p>We prepared a challenging modi cation of the ModelNet10 dataset by replacing
each original object with two modi ed copies. Every object is subject to random
rotation of angle up to . The objects are translated by a vector of random
direction and of random length from uniform distribution on [0; 0:25]. Additionally,
up to 3 cubes of random size and orientation are inserted into each point cloud,
so that they never intersect with the original objects. The inserted cubes were
represented by 50 points, and their maximum size was 0:5 0:5 0:5.</p>
      </sec>
      <sec id="sec-4-3">
        <title>SHREC17</title>
        <p>A subset of the ShapeNet dataset consisting of 51,162 triangle meshes of
objects. We use the provided 70%/10%/20% training/validation/test split for the
experiments. There are two variants of the SHREC17: normal and perturbed.
Here, we use the perturbed dataset where the objects are subjected to random
rotations. Point clouds are sampled from the provided triangle meshes by
sampling the triangles with probability proportional to their area, and then sampling
the triangle surfaces uniformly so that the obtained point clouds are consistent
with ModelNet point clouds. We use 1024 sampled points for the classi cation
and an additional feature engineering as required. For the methods that require
normal vectors, we calculate the normal vectors from the meshes rather than
from the sampled point clouds. It should be noted that there are both
inwardpointing and outer-pointing normal vectors in every mesh, which most likely
hinders the performance of some of the methods relying on the normal vectors
to some extent.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Models</title>
      <p>In this section we provide a description for every model that will be evaluated
in the next section. The model architectures were selected so that their sizes
would be roughly comparable in terms of the number of learnable parameters.
We did not ne-tune hyperparameters in this work as we were mainly interested
in observing major di erences of models, and we did not intent to achieve the
best performance.
1 PointNet: A small PointNet model. The shared MLP part of the model
before the maximum pooling is formed of fully connected layers with sizes 64,
64, 64, and 256 neurons. The MLP part of the model after maximum pooling
consists of dropout with probability 0.2 and two fully connected layers with 512
and 128 neurons.</p>
      <p>2 PointNetST: The same model as the previous PointNet model, but a
spatial transformer parametrized by linear or a ne transformations is additionally
inserted as the rst layer for appropriate tasks, that is: the linear transformer
for the tasks where the input objects are possibly rotated but not translated and
the a ne transformer for the rest. The spatial transformer itself is a PointNet
consisting of layers with 32, 32, and 128 neurons before the maximum function
and then a single layer of 128 neurons.</p>
      <p>
        3 PointNetSTL: Re-implementation of the original PointNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] with 2
differences: we only use the rst spatial transformer and we do not utilize
batchnormalization layers.
      </p>
      <p>4 Spin Images: This model makes predictions based on spin images only and
does not utilize the coordinates of the features. Spin images are of size 32 32.
Thirty-two spin images of radius 1 are utilized. The 32 points are selected by
farthest sampling algorithm. The model consists of 3D convolutional layers with
32, 64 and 128 lters of size 1 3 3 followed by 1 2 2, 1 2 2 and
32 1 1 maximum pooling, respectively, followed by dropout with probability
0:2 and fully connected layers with 512 and 256 neurons.</p>
      <p>5 Hierarchical Spin Images: This model utilizes both the representations
obtained from the spin images and the coordinates. Spin images are of size 16 8.
Thirty-two spin images of radius 0:6 are utilized. Spin images are processed by
the same 3D CNN from previous model, then the representation is concatenated
with point coordinates and fed into PointNet (Model 1).</p>
      <p>6 Hierarchical PointNet: Thirty-two point neighborhoods, each consisting
of 32 nearest points, are utilized. Each neighborhood is processed by a small
PointNet consisting of layers of 32, 32, and 32 neurons followed by max pooling
and 64 neurons. The extracted embedding is concatenated with point
coordinates and fed into PointNet (Model 1).</p>
      <p>7 Hierarchical PointNet Spin Coordinates: The same model as the
Hierarchical PointNet, but the local coordinates are rst transformed using the spin
image coordinate transformation.</p>
      <sec id="sec-5-1">
        <title>8 Hierarchical PointNet Orientation Alignment: This is the same as</title>
        <p>the previous model, except that orientation alignment layer (see Algorithm 1) is
additionally inserted after the concatenation of embeddings and coordinates.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <p>In this section, we describe the experiments that were carried out in order to
empirically compare suggested models and features. In order to evaluate
performance of the models, we measured classi cation accuracy on o cial test sets
given in the respective benchmark tasks. We further split the original training
sets into two parts for training and validation. We evaluate performance of
models after each epoch of training on validation partition of data. A model with
the best performance on validation data across all epochs is taken as a result of
the training. The categories are not balanced in terms of their frequencies, so
the data are split in strati ed manner meaning that frequencies of the categories
is the same in training and validation parts. We use the Adam optimization
algorithm with parameters ( = 0:001, 1 = 0:9, 2 = 0:99) and batch size
128. Training strategy is adjusted to take imbalanced categories into account by
lling batches in a way such that categories are uniformly distributed in each
batch. We apply L2-regularization of network weights with = 0:0001.
6.1</p>
      <sec id="sec-6-1">
        <title>Robustness to Rotations</title>
        <p>Let us informally de ne notions about object orientation in order to clarify
descriptions of experiments from this section. We assume that every object has a
unique reference pose which is given by semantics. Reference pose is described
by a canonical coordinate system. Pose of an observed object is then the
coordinate system (taken w.r.t. the canonical system) in which the object is in its
reference pose. When we refer to orientation of an object, we mean the pose
of the object without translation element, i.e. the coordinate systems are zero
centered. We will also use the notion of orientation vector, by which we mean
a vector parallel to one canonically chosen axis of the orientation coordinate
system.</p>
        <p>With the following experiment, we tested robustness of PointNet against
rotations of point cloud objects. We augmented the ModelNet10 dataset by
rotating the objects from dataset randomly. Two rotated samples of each object
were placed into the augmented dataset instead of each original object. We
then performed 10-fold cross-validation on the augmented dataset to evaluate
classi cation accuracy. The models PointNet and PointNetST (see Model 1 and
2) were subject to the experiment.</p>
        <p>The rotation matrices used for rotating the objects were sampled in such a
way, so that orientation vectors of all objects were distributed uniformly on a
cap of the unit sphere with the apex at the original orientation vectors. The
tested maximal angles of the rotations were 4 , 2 , 34 , and .</p>
        <p>Table 1 reveals that PointNet without spatial transformer is quite robust
versus rotations. Higher accuracy could be achieved with more augmentation
and further regularization techniques. Nevertheless, the decrease of accuracy is
noticeable. Spatial transformer clearly helps for the rotations of small angles,
but does not seem to help for the rotations of large angles, where the accuracy
is nearly the same for the PointNet and PointNetST models.
6.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Entropy of Orientation Distributions</title>
        <p>Spatial transformer was designed to decrease variance of orientations of the point
cloud objects present in the data. It seems highly probable that increase of
accuracy is correlated with the decrease of variance of orientations, but if we wanted
to compare other mechanisms for decreasing orientation variance, it might be
better not to rely solely on accuracy which might be also a ected by other factors.
With access to the original orientation of each object, we can directly measure
how the variance of the orientations is a ected by transformations produced by
spatial transformer or other techniques.</p>
        <p>We have only considered orientation vectors in this experiment for simplicity,
even though an orientation vector v is not su cient to fully describe orientation
of an object in 3D since the rotation component around v is left unspeci ed.
The unit orientation vectors of the objects can be viewed as samples from a
distribution X on the two-dimensional unit sphere S embedded in R3, which we
will refer to as the orientation distribution. The di erential entropy H(X)
of the distribution X with a probability density function f whose support is S
de ned as:</p>
        <p>
          Z
H(X) =
f (x) log (f (x)) dx
(1)
x2S
is a measure (not in the mathematical sense) of uncertainty of the distribution.
The lower the entropy of orientation distribution within a dataset is, the more
aligned the dataset is. By comparing entropy of the orientation distribution of
the augmented input data and the data transformed by the spatial transformer,
we can observe whether the spatial transformer performs alignment of the objects
or not. We do not have access to the probability density function directly for
computation of the entropy, but we can estimate the entropy from samples. We
chose the Kozachenko-Leonenko entropy estimator [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] for the purpose because
it relies on pairwise distances of the samples, which can be computed trivially,
whereas other approaches to the problem, e.g. that rely on density estimation,
are not so straightforwardly applicable on spherical distributions.
        </p>
        <p>Let X = (x1; x2; :::; xn), xi 2 Rd be the samples from the distribution
subject to the entropy estimation. Let (di)in=1 be the distances of the samples
xi to their k-th nearest neighbors, then the Kozachenko-Leonenko estimate can
be written as:</p>
        <p>Hb (X) =
(n)</p>
        <p>n
(k) + log(c) + d X log(di)
n
i=1
(2)
where is the digamma function, and c is the volume of the unit ball dependent
on the norm used to calculate the distances. In the case of X being distribution
on the unit sphere, the distance is de ned by the angle between samples, and
the c = 2 (1 cos 1) is the surface area of the spherical cap with the unit angle
between the apex and the edge.</p>
        <p>We repeated the experiment from Section 6.1 and we estimated the di
erential entropy of the orientation distributions of the data before and after
application of the transformations generated by the spatial transformer. Since spatial
transformer is not restricted to produce only orthogonal transformations, we
normalized the transformed orientation vectors in order to obtain spherical
distribution.</p>
        <p>Results of the entropy experiment are given in Table 2. We can see that there
is a relation between accuracy and di erential entropy of orientation distribution
by comparing the results with the experiment from previous section summarized
by Table 1, where spatial transformer was most helpful in the cases of maximum
angle up to 2 , which was also the case in this experiment. The experiment
suggests that the spatial transformer probably helps in certain cases, but it is likely
not a universal remedy for the problem of pose alignment. Careful adjustment of
the spatial transformer hyper-parameters might be needed in order to enjoy its
bene ts. Orientation alignment layer performed better than spatial transformer
on fully uniform rotations with more signi cant entropy reduction. Disadvantage
of orientation alignment layer is that it performs in a way that is independent
on the input orientation distribution entropy by the nature of Algorithm 1, so
the entropy after transformation is the same for all tests.
6.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>Benchmarks</title>
        <p>On datasets which are mostly aligned (ModelNet datasets), the PointNet models
1{3 perform well and additional local features wre not helpful for classi cation.
On the other two datasets, which are perturbed by rotations and additionally
translations in the case of Augmented ModelNet10, the Model 4 was superior
to others probably because of its invariance under rigid transformations. The
Model 5 seems stable in the sense that it is never substantially worse than the
best model in each task, so it seems that a combination of rotation invariant
features with absolute point coordinates is a promising direction.</p>
        <p>The performance of the Model 8, which utilizes orientation alignment layer,
was inferior in most cases, because the orientation alignment layer is harmful
when the data are well aligned. However, we see that in the SHREC17 task,
the presence of orientation alignment is bene cial compared to Models 6 and 7,
which indicates that reduction of orientation distribution entropy was achieved.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>We have empirically demonstrated that PointNet can bene t from point
neighborhood features on classi cation tasks where objects represented by point clouds
may appear in arbitrary orientation. Spin images seem to be promising
candidates of point neighborhood features. Experiments also suggest that the spatial
transformer technique, employed by PointNet in order to deal with the problem
of object orientation alignment, may be di cult to utilize properly depending
on the data. We have also proposed a simple experiment to measure quality of
alignment achieved by spatial transformer interpretatively on the tasks where
orientation of objects is known in advance. We have also introduced a simple
heuristic algorithm as an alternative to spatial transformer, which we call
orientation alignment layer. Further experiments suggest that orientation
normalization layer might be able to achieve better quality of orientation alignment than
spatial transformer on di cult data.</p>
      <p>In the future work, we would like to design better feature selection method
for our orientation alignment layer in order to make it more robust. We believe
that spatial transformer could also bene t from local point features, and we
would like to investigate the idea. It would also be possible to combine spatial
transformer and orientation alignment layer into single model. Finally, we intent
to compare spin images with other rotation invariant features.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Hoppe</surname>
          </string-name>
          , H., DeRose, T.,
          <string-name>
            <surname>Duchamp</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stuetzle</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Surface reconstruction from unorganized points</article-title>
          .
          <source>SIGGRAPH Comput. Graph</source>
          .
          <volume>26</volume>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Jaderberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , kavukcuoglu, k.:
          <article-title>Spatial transformer networks</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>28</volume>
          , pp.
          <year>2017</year>
          {
          <year>2025</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>A.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hebert</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Using spin images for e cient object recognition in cluttered 3d scenes</article-title>
          .
          <source>In: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)</source>
          . pp.
          <volume>433</volume>
          {
          <issue>449</issue>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kanezaki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsushita</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nishida</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints</article-title>
          .
          <source>In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Klokov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lempitsky</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Escape from cells: Deep kd-networks for the recognition of 3d point cloud models</article-title>
          .
          <source>In: 2017 IEEE International Conference on Computer Vision</source>
          (ICCV) (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kozachenko</surname>
            ,
            <given-names>L.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leonenko</surname>
            ,
            <given-names>N.N.</given-names>
          </string-name>
          :
          <article-title>Sample estimate of the entropy of a random vector</article-title>
          .
          <source>Probl. Peredachi Inf</source>
          .
          <volume>23</volume>
          ,
          <issue>9</issue>
          {
          <fpage>16</fpage>
          (
          <year>1987</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Maturana</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Voxnet: A 3d convolutional neural network for real-time object recognition</article-title>
          .
          <source>In: IEEE/RSJ International Conference on Intelligent Robots and Systems</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>C.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guibas</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          : Pointnet:
          <article-title>Deep learning on point sets for 3d classi cation and segmentation</article-title>
          .
          <source>Proc. Computer Vision</source>
          and
          <article-title>Pattern Recognition (CVPR), IEEE (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>C.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guibas</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          : Pointnet++:
          <article-title>Deep hierarchical feature learning on point sets in a metric space</article-title>
          .
          <source>Neural Information Processing Systems (NIPS)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>C.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Nie ner, M.,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guibas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Volumetric and multiview cnns for object classi cation on 3d data</article-title>
          .
          <source>In: Proc. Computer Vision</source>
          and
          <article-title>Pattern Recognition (CVPR), IEEE (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalogerakis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Learned-Miller</surname>
            ,
            <given-names>E.G.</given-names>
          </string-name>
          <article-title>: Multi-view convolutional neural networks for 3d shape recognition</article-title>
          .
          <source>In: Proc. ICCV</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khosla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>3d shapenets: A deep representation for volumetric shapes</article-title>
          .
          <source>In: IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>