=Paper=
{{Paper
|id=Vol-2282/EXAG_115
|storemode=property
|title=Towards 3D Neural Style Transfer
|pdfUrl=https://ceur-ws.org/Vol-2282/EXAG_115.pdf
|volume=Vol-2282
|authors=Jo Mazeika,Jim Whitehead
|dblpUrl=https://dblp.org/rec/conf/aiide/MazeikaW18
}}
==Towards 3D Neural Style Transfer==
Towards 3D Neural Style Transfer Jo Mazeika, Jim Whitehead Computer Science Department UC Santa Cruz Santa Cruz, CA 95064 USA {jmazeika, ejw}@soe.ucsc.edu Abstract models carry a lot of information in their positional data. We Neural Style Transfer was first unveiled by (Gatys, Ecker, chose an existing classification network for point clouds— and Bethge 2015), and since then has produced fantastic re- an analog of the network chosen for the 2D case—and ex- sults working with 2D images. One logical extension of this plored the different layers to identify which would be the field would be to move from 2D images into 3D models, and best to use for style transfer. We additionally show the re- be able to transfer a notion of 3D style from one model to sults of our experimentation with different 3D models. another. Here, we provide steps towards both understanding 3D style transfer would be useful as a design and ideation what style transfer in a 3D setting would look like, as well as tool — by creating models that embody various styles, and demonstrating our own attempts towards one possible imple- a generic model, a designer could use a style transfer sys- mentation. tem to create versions of the original object in the different styles. These, while not perfect, would give said designer a Introduction springboard to create finalized versions of these 3D models. (Gatys, Ecker, and Bethge 2015) introduced a technique for Additionally, having a system like this would also for in- transferring the style of one image onto another, exploiting teresting manipulation of scanned 3D spaces — since laser properties of convolutional networks to extract the informa- scanners produce point cloud models, being able to perform tion required. Since then, the technique has been refined and style transfer on the results could allow for interesting be- expanded upon, with impressive results. A survey paper of spoke spaces. the field (Jing et al. 2017), containing references up through In this paper, we provide the following contributions: March 2018, has over one hundred different papers listed. • An analysis of how to represent style in a 3D space Because of this interest, it is natural to wonder where this • An implementation of a style transfer system using a net- technique could be applied to next. Here, we describe our work designed for 3D models attempts to apply the techniques underlying style transfer to the domain of 3D point cloud models. • An analysis of our results and what future work will be At its core, 2D style transfer works by optimizing a noise required to make a system like this a reality. vector to minimize a function describing its distance from both the style and content images at different layers within Related Work a neural network. In the original paper, the authors utilize Deep Learning and Point Clouds VGG-19, a network trained to classify images in the Ima- Deep learning systems operate natively over large arrays of geNet data set, and by comparing the floating point numbers, which makes point clouds a natural Once the network is chosen, the system’s designer picks a way of encoding 3D models for neural processing — unlike particular set of layers of the network, and gets the values of polygonal models, point clouds exist as a simple list of 3D the input images as well as the output image at each of the points in space. Additionally, several common techniques different layers. From there, the distances between those val- for analyzing real world 3D spaces map those spaces into ues are computed and turned into a single loss value, which point clouds, making them an important target to understand is used to compute the gradient by which we transform the and process. noise image, and we continue the process. Once a specified In our work here, we focused on examining PointNet (Qi number of iterations is completed, the noise vector (which et al. 2016), a network designed primarily to classify point conveniently is chosen to be a N x M x 3 vector) can be in- clouds into several different object classes. Since its publica- terpreted as a N x M bitmap and rendered into a image using tion, it has seen a number of follow-up works, including one standard image libraries. extension by the original authors (Qi et al. 2017) as well as To implement this for the 3D case, we first explored what others that extended it beyond preselected 3D models (Zhou style transfer in this domain would mean, given that 3D and Tuzel 2017). PointNet is not the only system family for neural analysis of point clouds, however. It utilizes and benchmarks itself against the ModelNet dataset (Engelmann et al. 2017) of 40 observer. Additionally, 3D models often have critical, func- different classes of point clouds for systems to distinguish tional parts that impact how they are perceived—an airplane between. ModelNet’s website1 lists over thirty different sys- without wings could be nearly impossible to identify as an tems and their relative accuracy rankings on their benchmark airplane without some other strong ‘airplane-like’ features. dataset. Given that our models were point clouds with no texture 3D style systems information, the only changes we could attempt to make were modifying the positions of the different points in 3D While neural style transfer has not been implemented in this space. This limited our possible options for stylistic features sense, there are a number of other systems that attempt to to consider during the transfer process, however we came up transfer the style of one 3D object onto another. For instance, with several different visions of what 3D style transfer could (Zheng, Cohen-Or, and Mitra 2013; Lun et al. 2016) focus look like. on handling style by breaking each object apart into indi- vidual components and reassembling them based on their First, we have the conceptually simplest version which we structural similarity. In contrast, (Ribeiro et al. 2003) looks call exemplar style transfer. In this, the target object is mod- at style as a conceptual blend, using an external knowledge ified to look like the target object, but purely on a cosmetic base instead of looking purely at the structural similarity to level without interfering with its functionality. For instance, build the comparisons. (Hu et al. 2017) focuses on identify- an umbrella could be blended with a sword by transforming ing decorative elements that convey stylistic features across the handle of the umbrella into the hilt of the sword, or a an array of different objects. (Ma et al. 2014) takes an anal- glass mug could be molded to look like any number of exist- ogy approach to style, starting from a set of example mod- ing object. In this way, the target object resembles the style, els (one initial model, with one structural variation and one but maintains all of its own properties. stylistic variation) and then extrapolating those variations Secondly, we have the converse of the above, which we into a new model. In contrast, (Kalogerakis et al. 2012) call functional part transfer. Here, instead of transferring the learns a probabilistic model of an object class, allowing it cosmetic features of the style object, we instead transfer the to generate different models that exist within that space, and functional parts of said object. For instance, we could take in that way defines a style of model. an skateboard model and add strings and frets from a guitar along its body as our way of blending the two objects. Again, 3D styles the target object still maintains most of its own identify, but When we look at the results from (Gatys, Ecker, and Bethge takes on features of the style object to produce the blend. 2015), we can quickly understand what aspects of an image Next, we have mix-in style transfer where the two models the algorithm understands as being its style. These aspects are joined together, creating a model that features pieces of include the color palette, the length and shape of the strokes both attached together. For instance, we could have the blend used in painting, as well as some parts of the content - an of an airplane and an apple where the airplane has a stem image generated from van Gogh’s Starry Night retains its coming out of it, or the apple has wings and the tail of the bright stars in various portions of the sky. The other image - airplane. the content image - provides most of the underlying structure and direction of the image itself; the geometry of that image Another possibility is the part-blend style transfer where is preserved and we see that image as ”stylized” in the style the parts of one model are made out of similar parts of the of the other. other. In this way, we focus on transferring the local aspects However, for 3D models, the question of how to mod- of style, while keeping the broader model’s structure fun- ify one model to match the style of another is a non-trivial damentally the same. One example of this would be Lego question, especially given that it’s unclear what a 3D model models — while the boarder structure of the model is kept, is comprised of in general. A 3D model’s geometry could the local features of the model now must conform to Lego be constructed of a number of triangles or a point cloud, bricks. to pick two examples. Then, some models feature textures Our final concept for style transfer would be to create an while others don’t. With all of these differences in how 3D abstract style definition and apply that directly. This is prob- models are encoded, it is unclear how we would transfer ably the most complex approach to implement, as it requires style between two sufficiently different models. the system to be able to translate the abstract style informa- Furthermore, even if we have two similarly structured tion into the transformations that the model should undergo; models, we need to figure out how to transfer styles between however, this does allow for the most control over what el- the two models. In this paper, our system was implemented ements of style are actually transfered onto the new model. to follow (Gatys, Ecker, and Bethge 2015) fairly directly, This is most similar to the work on Lego models done in however this is only one way that the style of a 3D model (Mazeika and Whitehead 2017). could be interpreted. Given the nature of 3D models, mak- ing adjustments to their structure - moving a few things here These are not meant to be conclusive; rather these form a or there, adjusting the size of a small part of the model, etc. - possibility space of how 3D style transfer could occur. For can have a large impact in how the model is interpreted by an the purposes of aligning with the 2D case, we chose to focus our system on the mix-in style, using the neural networks to 1 http://modelnet.cs.princeton.edu find the relationships between the models and their parts. Figure 1: A style transfer system diagram. The center box represents the point net network, which is comprised of several intermediate stages (a more detailed diagram can be found in (Qi et al. 2016)). We select particular layers from the network use to compute our loss function. Style Transfer Implementation the square-mean distance between the layers for the content For our system, we utilized PointNet, an existing network image and the sum of the difference between the Gram Ma- designed to classify 3D models into different categories. In trices of the style image and the output. the original style transfer paper, the authors chose VGG-19, For our 3D version, we investigated using these loss func- a network for classifying 2D images, and so we chose an tions, but also other ones found in our exploration of met- analogous network for our system. PointNet operates in two rics designed for Point Clouds. To this end, we included the different modes — one that classifies point clouds into dif- Hausdorff distance and the Chamfer distance as metrics to ferent classes, and one that learns to segment point cloud, consider for our loss functions. The Hausdorff distance is labeling each point of a cloud with a domain specific label defined in our system as (ie the wings versus the body of an airplane, or the wheels dH (X, Y ) = max{max min d(x, y), max min d(x, y)} x∈X y∈Y y∈Y x∈X versus the handlebars of a motorcycle). PointNet’s classifi- cation network is structurally similar to VGG-19, comprised where d(x, y) is the euclidean distance between the vectors of multiple convolutional layers, but it also includes two par- x and y. Similar, we define the Chamfer distance in our sys- ticular layers: one that learns a three-by-three matrix product tem as X X (intended to account for rotations in the model) and one that dC (X, Y ) = min d(x, y) + min d(x, y) is intended to learn a permutation function (as point cloud y∈Y x∈X x∈X y∈Y data is invariant to the input order). using the notation from above. In English, the Hausdorff dis- The segmentation network in PointNet uses the classifica- tance looks at the minimum distance from each vector to any tion network as its basis, and adds on an extra four layers for vector in the other set and returns the overall maximum of producing the output labels. This version of PointNet has these values, while the Chamfer distance givens the sum of a major drawback for our purposes: it must be trained on these minimal distances instead. each individual class of model, rather than on all classes at Both of these metrics look for outliers within the space — once. While that means that we can’t use a trained segmen- the Hausdorff distance is maximized when a single vector tation network for general style transfer, it makes sense that is far away from ones in the other set, while Chamfer looks we could use it as a tool for modeling individual classes of more at the average distance for all of the vectors in both objects. sets. Importantly, they are also both differentiable, which is Once we have a fully trained model, we then need to pick a strict requirement for our loss functions. the set of layers to consider for computing the loss func- Finally, one of the key components of any system that tion. When a model is evaluated by the network, it considers hopes to take multiple components into a single value is more abstract representations of the model at each of the weighting. Since different functions over different layers can subsequent layers. In the original style transfer system, the produce values on wildly different orders of magnitude, nor- authors considered earlier layers for the structure and later malizing the values to both balance the different components layers for the style of the various images. Here, we use a against each other, and to provide some bias towards either similar metric for picking the layers to use for our system, the structure or the style. and show the results of exploring this space later. Implementation Loss Functions While PointNet is publicly published in Tensorflow2 , we Once the layers have been chosen, it then falls to pick a loss chose to implement our system in Keras instead, due to function to evaluate how far the generated image is from the 2 inputs on those particular layers. To do this, the system uses https://github.com/charlesq34/pointnet Figure 3: A model and its twisted counterpart (3rd convolutional layer using the Gram loss function) ble candidate to consider. And, finally, a few layer function pairs simply reproduced the initial model instead, with some small variations due to the fuzziness of the optimization pro- cess. This was disappointing to see; while we could reproduce Figure 2: The common result of optimizing on arbitrary the initial model in the very early layers, we had hoped to layer-function pairs see the noise clouds showing abstract features of the partic- ular model class. This may have resulted from our choice of network — since PointNet attempts to learn how mod- familiarity. We used an existing Keras implementation3 as els are structured, regardless of how the points are arranged a reference for our implementation, and we used the style and permuted, it could be the case that the abstract features transfer code provided in (Chollet 2017) as the starting point are being represented in a way that in imperceptible to hu- and reference for our implementation. Our reimplementa- mans. In the 2D case, we’re able to see patterns and varia- tion required us to train PointNet ourselves, and we did so tions between different images, as those appear as variations using the ModelNet data set provided by (Wu et al. 2015). in color, but the 3D case solely considers the positions of the For the labeled segmentation data, we used the data set pro- different points, meaning that there might be relations that vided by (Yi et al. 2016). Our overall accuracy results for are not actually the ones we want to express being learned. training were comparable with the original paper. Classification Model Results Results With our results from the previous exploratory work, we at- tempted to blend models with one of the layer-loss pairs that Exploration of Layers and Loss Functions simply reproduced the initial model; specifically the Cham- As we began our exploration of style transfer for 3D models, fer loss on the third convolutional layer of the classification we first began by optimizing our input against a single layer model. One of the key features of style transfer is balancing to see how the different layers respond to different loss func- the different loss values against each other — most of the tions, hoping to identify layers that correspond to stylistic or 2D systems feature a weighing system in which the differ- structural features to optimize for. Our intuition comes from ent sides of the function (the similarity to the style and the the 2D example, where we can extrapolate what the individ- similarity to the initial content) are balanced against each ual layers have learned by optimizing for them directly, as other to get the desired blend. shown in (Rupprecht 2017). To this end, we took two models, and blended them at We considered all of the convolutional layers of the clas- different ratios between the different sides — we fixed the sification model of PointNet and used a fixed random noise content weight at 1, and scaled the style weights through input with the Squared Sum, Gram Matrix, Hausdorff and different powers of 2. We used the third convolutional layer Chamfer loss functions. with Chamfer loss for the content and Gram loss for the style However, most of our layer-function pairs lead to fuzzy as our basis for our loss function. While the Chamfer loss noise-spheres, as seen in Figure 2. On the other hand, we reproduced the original model at that layer, the Gram loss did have a few pairs that led to interesting results. Some of created a twisted version of the model, as seen in Figure the lower layers simply produced a twisted version of the 3. Here, we hypothesized that the twistedness was due to original model (see Figure 3). Here, the model is vastly de- having a slightly more abstract understanding of the original, formed and rotated upside down, which we took as a possi- and that we would see this conveyed in the style transfer process. 3 https://github.com/garyloveavocado/pointnet-keras The results of blending an airplane model and a model of Figure 4: A ground-truth segmentation for a motorcycle Figure 5: The results of segmentation style transfer on the model motorcycle model with different weights a woman in a dress are shown in Figure 6. At the extreme values (215 and 2−4 ), our system effectively optimizes for one model or the other, as the loss value is dominated by the output’s distance from that model. In the middle, we see blends in of the two models; however, this occurs merely on the level of the points’ actual distances from each other — no abstract qualities are carried between the two models. One of the clear qualities here is that the blends are con- tained in, effectively, the intersection in space between the two models. This suggests that one of the big issues here is orientation, since the airplane lays flat while the person is standing upright. Additionally, using other layers turns the output into a fuzzball, so this proved to a dead end. Segmentation Model Results Finally, we attempted to utilize the segmentation version of PointNet to transfer the underlying model class’s style onto an unrelated model. To do so, we first trained the network to perform segmentation on airplanes. We then took a motor- cycle model (seen in Figure 4) and assigned each one of its labels a particular label from the airplane label set (i.e., the body of the motorcycle corresponded with the body of the airplane; the wheels corresponded with the engines, etc.). From here, we built our loss function to optimize for the original structure of the motorcycle against each point re- ceiving its correct label under the classification system. The intent here was to create a new motorcycle such that each one of its components was interpreted as part of an airplane. Unfortunately, again, the results were suboptimal. We Figure 6: Two models (airplane and woman in a dress) again tested various weights of style and content, as seen in blended at different ratios of style weight and content Figure 5, but optimizing for style merely leads to the model weight. Labeled values are the style weight compared to a expanding and not, as hoped, being reshaped. content weight of 1. Discussion Since our experiments produced negative results, the ques- tion then becomes ”why?” — what led to our results, and what can we learn from these experiments? Fundamentally, there are two high-level cases for these failures: either the system had some errors of design, or there are theoretic fac- However, the deeper issue remains of what style even tors that prevent this approach from working at all. means for point clouds. When we look at style transfer for Fundamentally, style transfer in the 2D domain relies on images, we see clearly what aspects of style are picked up by being able to detect the edges (for the content image) and the system—colors, line curviness, patterns, etc—and how the color palette (for the style image) of an image in a way those are applied to the content image. With 3D style, it is that can be optimized for. Doing so requires the system to unclear what features a neural network would pick up on— examine relationships between spatially related pixels, and would it key into the shapes of the different parts, the differ- then shift a noise vector until it grows closer and closer to ent relative positions of things, the orientation of the model these features. This works well for images, since the images itself, or the patterns of points within the model itself? are represented as a 2D spatial matrix (with a third dimen- sion representing color values). On the other hand, the point Conclusions and Future Work clouds are merely a list of different points in space. While In this paper, we attempted to apply the neural style transfer PointNet itself attempts to compensate for the input order techniques that have seen so much success in the domain of (by learning a reordering function about halfway through the 2D images to the domain of 3D point clouds. Despite this, system), this may cause the issues with the loss function. our system was unable to perform style transfer as is seen Additionally, one of the common stylistic features of the in the 2D style transfer systems. While negative results do 2D style transfer images (independent of the style image it- not necessarily provide conclusive evidence, our exhaustive self) is a certain amount of fuzziness in the output—edges exploration provides a strong argument against this being often have a certain amount of fuzziness to them, a result of possible. the optimization process. But, because of the overall shape However, there is room to explore style transfer within of the output and the color patches included, humans are still point cloud models. One idea would be to use a segmenta- able to recognize the contents of the image without much is- tion model (such as PointNet) to design a parts-based style sue. However, in our 3D case there is no color channel that transfer system, similar to (Lun et al. 2016). Additionally, we can rely on for context. As such, the only information we exploring other neural networks and other ways of encoding have to work with in the system are the locations of the dif- style in a 3D space could provide interesting results as well. ferent points. Because of this, the results are highly sensitive to points being moved around, which means that the fuzzi- References ness that results from the style transfer can lead to results Chollet, F. 2017. Deep learning with python. Manning that are impossible for humans to interpret correctly. Publications Co. For neural style transfer to be possible for 3D point Engelmann, F.; Kontogianni, T.; Hermans, A.; and Leibe, B. clouds, several adjustments would need to be made. First 2017. Exploring spatial context for 3d semantic segmenta- of all, a different neural network would likely need to be tion of point clouds. In Proceedings of the IEEE Conference considered. One of the key aspects of PointNet is that the on Computer Vision and Pattern Recognition, 716–724. network tries to learn a permutation function that allows it to detect the models correctly regardless of the order their Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2015. A neural points are in. This factor may be part of the issue we have in algorithm of artistic style. arXiv preprint arXiv:1508.06576. producing visible results, and other networks on point clouds Hu, R.; Li, W.; Kaick, O. V.; Huang, H.; Averkiou, M.; may feature clearer results. Cohen-Or, D.; and Zhang, H. 2017. Co-locating style- Secondly, a different set of loss functions would need to defining elements on 3d shapes. ACM Transactions on be considered. While we tried to include relevant loss func- Graphics (TOG) 36(3):33. tions to point clouds in general, it might be the case that Jing, Y.; Yang, Y.; Feng, Z.; Ye, J.; Yu, Y.; and Song, M. others exist that would work better with the particular net- 2017. Neural style transfer: A review. arXiv preprint work or point clouds in general, and other functions would arXiv:1705.04058. have produced intelligible results. We chose our functions Kalogerakis, E.; Chaudhuri, S.; Koller, D.; and Koltun, V. based on metrics that were used previously in style transfer 2012. A probabilistic model for component-based shape and known functions for examining point clouds, and ran an synthesis. ACM Transactions on Graphics (TOG) 31(4):55. exhaustive search over. Lun, Z.; Kalogerakis, E.; Wang, R.; and Sheffer, A. 2016. Finally, in this work, we only considered one layer at a Functionality preserving shape style transfer. ACM Trans- time — most of the existing work on style transfer looks at actions on Graphics (TOG) 35(6):209. multiple layers at once and averages the loss between all of them. This provides more consistent results and the gener- Ma, C.; Huang, H.; Sheffer, A.; Kalogerakis, E.; and Wang, ated images benefit from these multilayer views. However, R. 2014. Analogy-driven 3d style transfer. In Computer in the 2D case, it is clear what sorts of features are captured Graphics Forum, volume 33, 175–184. Wiley Online Li- by the different layers of VGG-19; our attempts to visual- brary. ize the PointNet network were inconclusive. As we ran an Mazeika, J., and Whitehead, J. 2017. Solving for bespoke exhaustive set of experiments over the layer-loss pairs, it game assets: Applying style to 3d generative artifacts. is unlikely that we missed any layer that would drastically Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2016. Pointnet: change our results, and reduplicating layers is the equivalent Deep learning on point sets for 3d classification and segmen- of doubling the value of the weight. tation. arXiv preprint arXiv:1612.00593. Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413. Ribeiro, P.; Pereira, F. C.; Marques, B. F.; Leitão, B.; Car- doso, A.; Polo, I.; and de Marrocos, P. 2003. A model for creativity in creature generation. In GAME-ON, 175. Rupprecht, P. 2017. Understanding style transfer. https: //ptrrupprecht.wordpress.com/2017/12/ 05/understanding-style-transfer/. Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; and Xiao, J. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1912–1920. Yi, L.; Kim, V. G.; Ceylan, D.; Shen, I.-C.; Yan, M.; Su, H.; Lu, C.; Huang, Q.; Sheffer, A.; and Guibas, L. 2016. A scalable active framework for region annotation in 3d shape collections. SIGGRAPH Asia. Zheng, Y.; Cohen-Or, D.; and Mitra, N. J. 2013. Smart variations: Functional substructures for part compatibility. In Computer Graphics Forum, volume 32, 195–204. Wiley Online Library. Zhou, Y., and Tuzel, O. 2017. Voxelnet: End-to-end learning for point cloud based 3d object detection. arXiv preprint arXiv:1711.06396.