1. Introduction

Adversarially-Guided 3D Shape Deformation via Diferentiable Rendering and 2D Supervision

Andrea Gevasio

Christian Napoli

Katarzyna Nieszporek

0 0 Department of Artificial Intelligence, Czestochowa University of Technology , Czestochowa , Poland

67 74

Recovering 3D geometry from 2D observations is a fundamental challenge in computer vision, with applications in animation, virtual reality, and robotics. Recent advances in diferentiable rendering have enabled gradient-based optimization of 3D shapes using only image supervision. In this work, we propose a novel adversarial framework that enhances 3D mesh deformation by integrating a diferentiable renderer into a Generative Adversarial Network (GAN). The generator deforms an initial mesh and optimizes textures to match 2D supervision from target images, while the discriminator-featuring dense connections and self-attention-learns to distinguish between real and synthesized renderings. Our method improves upon baseline diferentiable renderers both quantitatively and qualitatively, achieving lower Chamfer distance and higher Intersection over Union (IoU) across a variety of object categories. The results demonstrate that adversarial training efectively guides mesh deformation, producing reconstructions that are more accurate and visually consistent with target images.

eol>diferentiable rendering shape deformation 2D guidance adversarial training

1. Introduction

generalizable reconstruction pipelines.

Despite these advances, current diferentiable methods Reconstructing 3D geometry from 2D images is a long- often produce over-smoothed or inaccurate shapes, parstanding goal in computer vision, with applications in ticularly when only limited views are available [ 3, 4, 5 ]. virtual and augmented reality, medical imaging, robotics, To address this, we propose augmenting diferentiable and digital content creation. Accurate 3D models enable rendering with adversarial supervision. Our method inrealistic simulations, enhanced diagnostics, and immer- tegrates a Generative Adversarial Network (GAN) [ 6 ] sive experiences. However, inferring 3D structure from into the mesh deformation pipeline, where the generator limited 2D information remains a fundamentally ill-posed deforms an initial template mesh to match reference improblem, particularly when dealing with complex shapes, ages, and the discriminator learns to distinguish real from partial occlusions, or diverse object categories. generated renderings. The discriminator architecture in

Traditional 3D reconstruction methods include point cludes dense connections and self-attention to efectively cloud processing, voxel grids, and mesh-based optimiza- capture fine-grained spatial features [ 7, 8, 9, 10 ]. tion. While efective in constrained settings, these ap- Our contributions are as follows: proaches often struggle with generalization and scalability. Voxel-based methods are limited by memory and • We introduce an adversarially-guided 3D shape resolution constraints, point cloud methods require dense deformation method that leverages diferentiable input data, and mesh optimization techniques often need rendering and 2D supervision. handcrafted objectives and careful initialization. These • We design a discriminator with dense blocks and limitations are further exacerbated in dynamic or uncon- self-attention to improve shape fidelity and detail strained environments. preservation.

Recent advances in deep learning have enabled sig- • We integrate texture optimization and silhouette nificant progress, particularly with the advent of difer- supervision to refine appearance and geometry entiable rendering. By making the rendering process simultaneously. diferentiable, neural networks can be trained end-to-end • We demonstrate quantitatively and qualitatively to optimize 3D shape and appearance directly from 2D that our approach outperforms baseline diferenimages. Frameworks such as Soft Rasterizer [ 1 ] and Py- tiable rendering methods across diverse object Torch3D [ 2 ] allow backpropagation of image-space losses categories. to 3D geometry, opening the door to more flexible and

2. Related Work

ing using sparse Cholesky factorization. While efective, such techniques are computationally intensive and remain sensitive to initialization and viewpoint ambiguity.

The problem of reconstructing 3D geometry from 2D observations has been studied extensively, with a range of techniques proposed over the years. These methods 2.4. Adversarial Training for 3D Shape can be broadly categorized into classical reconstruction pipelines, deep learning-based approaches, and diferen- Generation tiable rendering frameworks. Recent eforts have also Adversarial learning has recently been applied to 3D explored the integration of adversarial training to im- tasks to improve realism and detail preservation. For prove reconstruction fidelity. instance, GANs have been employed to refine volumetric reconstructions or to hallucinate missing geometry. 2.1. Classical 3D Reconstruction However, applying adversarial training in the context of diferentiable mesh deformation remains underexplored.

Our work addresses this gap by introducing a discriminator tailored to rendered images, combining dense blocks and self-attention mechanisms to provide fine-grained feedback to the generator during training.

Traditional methods rely on structured representations such as point clouds, voxel grids, or explicit meshes. Point cloud-based methods require dense and accurate data, which is often impractical to obtain without expensive scanning equipment. Voxel-based approaches discretize space into uniform grids [? ], but sufer from memory and resolution limitations. Mesh-based methods, while 2.5. Summary eficient in representing surfaces, often require manual In summary, while diferentiable rendering has signifiinitialization and lack robustness in unconstrained sce- cantly improved 3D reconstruction from 2D supervision, narios. challenges remain in achieving high-quality, generalizable mesh deformations. Adversarial training ofers a 2.2. Learning-Based Mesh Reconstruction promising solution by encouraging visual realism and better structural consistency. Our work builds upon this direction by integrating adversarial loss into a diferentiable mesh optimization pipeline, enabling the reconstruction of more detailed and accurate 3D shapes.

Deep learning has significantly advanced mesh-based reconstruction. Pixel2Mesh [ 11 ] introduced a framework that deforms an initial ellipsoid mesh using graph convolutional networks guided by 2D image features. It demonstrated the efectiveness of learning-based deformation but struggled with fine-grained topology and texture detail. The 3Deformer model [ 12 ] further improved mesh deformation by incorporating image features into a neural mesh deformation pipeline, achieving high fidelity in geometry and structure. However, these methods are still sensitive to initialization and object complexity.

2.3. Diferentiable Rendering

Diferentiable renderers provide a powerful tool for optimizing 3D representations directly from image-space losses. Soft Rasterizer (SoftRas) [ 1 ] introduced a probabilistic rendering function that enables gradient-based optimization through occlusions and visibility. PyTorch3D [ 2 ] extended this idea into a flexible rendering framework for 3D deep learning. Wen et al. [13] built upon diferentiable rendering to jointly reconstruct shape and appearance from single-view images using an encoderdecoder architecture, demonstrating improved color and surface detail. However, these methods often sufer from oversmoothing and limited detail recovery, especially under sparse supervision.

Nicolet et al. [14] proposed improving the stability of gradient-based optimization in diferentiable render67–74

3. Method

We propose an adversarial training pipeline for 3D mesh deformation, guided by diferentiable rendering and 2D supervision. The system consists of a generator that deforms a base mesh to match a target image, and a discriminator that evaluates the visual realism of rendered outputs. The generator is optimized using a combination of reconstruction and regularization losses, while the discriminator provides adversarial feedback based on rendered RGB images and silhouettes.

3.1. Overview

Given a target image of a 3D object, our goal is to deform a source mesh (initialized as a sphere) and optimize its texture to match the target. The mesh is rendered from multiple viewpoints using a diferentiable renderer, and the rendered images are compared to the ground truth using a composite loss function. The system alternates between optimizing the generator (mesh and texture) and training the discriminator to distinguish between real and synthetic renderings.

3.2. Data Preparation and Normalization To ensure generalization across diverse shapes, we construct a dataset using freely available 3D models in the OBJ format [15]. Each model includes geometry (.obj), materials (.mtl), and texture maps.

Meshes are normalized by translating them to the ori- The total generator loss is defined as: gin and scaling them to fit inside a unit sphere. This ensures consistent scale and positioning, which simplifies ℒ = RGBℒRGB + silℒsil + edgeℒedge+ optimization and stabilizes training. Meshes are loaded using PyTorch3D’s load_objs_as_meshes function, + normℒnorm + lapℒlap + advℒadv which constructs batched Meshes objects for down- where the weights are empirically set (see Section 4). stream processing. • Laplacian Smoothing Loss: ℒlap — Penalizes

large deviations from the mean vertex position. • Adversarial Loss: ℒadv — Binary cross-entropy loss from the discriminator, encouraging realism in rendered outputs.

3.3. Diferentiable Rendering Pipeline We render each mesh from multiple viewpoints using PyTorch3D [2]. The rendering setup includes:

3.3.1. Camera Configuration

3.5. Adversarial Discriminator The discriminator is a CNN designed to assess the realism of rendered RGB images. It integrates dense connections and self-attention to better capture spatial patterns and long-range dependencies.

3.5.1. Architecture We use multiple perspective cameras placed at uniformly sampled viewpoints around the ob- The network consists of an initial convolutional block ject. Camera transformations are computed using followed by two dense blocks with growth rate 32 and look_at_view_transform, and projection is per- intermediate channels of 64 and 192, respectively. Each formed using FoVPerspectiveCameras. dense block is followed by a self-attention layer with scaled dot-product attention. A final convolutional layer 3.3.2. Lighting and Shading reduces the feature map to a scalar output passed through a sigmoid activation function. Spectral normalization is applied to all convolutional layers to stabilize adversarial training.

Lighting is modeled with a single PointLights source positioned above and to the side of the object. For RGB rendering, we employ a SoftPhongShader, which models ambient, difuse, and specular reflections. For silhouette rendering, we use a SoftSilhouetteShader with a thresholded alpha channel to extract binary object contours. 3.3.3. Rasterization Settings

Rasterization is configured with a fixed image resolution and blur radius. We adjust the number of faces per pixel to trade of quality and rendering speed. 3.4. Loss Functions

The generator is trained to minimize a composite loss comprising multiple terms: • RGB Loss: ℒRGB — L2 loss between rendered and target RGB images. • Silhouette Loss: ℒsil — L2 loss between rendered and target silhouettes. • Edge Loss: ℒedge — Encourages preservation of mesh edge lengths to prevent distortion. • Normal Consistency Loss: ℒnorm — Promotes smooth surfaces by enforcing normal alignment between adjacent faces.

3.6. Training Procedure

Training proceeds in alternating steps: 1. Generator step: A batch of viewpoints is sampled. The generator deforms the mesh and optimizes texture to minimize the total loss ℒ. 2. Discriminator step: The discriminator receives real target images and generated renderings. It is trained using binary cross-entropy to maximize classification accuracy.

The generator is optimized using stochastic gradient

descent with momentum, while the discriminator uses the Adam optimizer. Training is performed for a fixed number of iterations, with periodic visualizations to track progress.

3.7. Implementation Details Our implementation uses PyTorch and PyTorch3D, with

training conducted on Google Colab using an NVIDIA GPU runtime. All meshes are batched for eficient parallel processing. Code modules are structured for data loading, rendering, loss computation, and model optimization.

4. Experiments

We evaluate our method on a diverse set of 3D objects and compare it against a baseline diferentiable rendering pipeline using PyTorch3D. Both quantitative metrics and qualitative visualizations are used to assess reconstruction accuracy, mesh quality, and generalization capability.

4.1. Experimental Setup 4.2. Evaluation Metrics We use the following metrics to assess performance:

• Reconstruction Loss: The total loss defined in Section 3, combining RGB, silhouette, and regularization terms. • Chamfer Distance: Measures point-wise similarity between predicted and target meshes. • Intersection over Union (IoU): Measures volumetric overlap between the generated and ground truth meshes. • Visual Quality: Qualitative comparisons of mesh renderings across viewpoints.

Unless otherwise noted, all reported values correspond to 2000 training iterations. Extended results for 10000 iterations are provided in the appendix. 4.3. Quantitative Results In Figure 2, we observe that PyTorch3D fails to accurately reconstruct the finger geometry of the hand object, while our model preserves the detailed articulation more efectively. 4.5. Extended Training Analysis Training the models for 10000 iterations improves both

reconstruction loss and geometric fidelity. Full tables and visualizations are provided in Appendix A. Notably, our method consistently outperforms the baseline in Chamfer distance and IoU with longer training, especially for highfrequency shapes such as hand and sword.

5. Extended Results This appendix presents additional experimental results obtained by training the models for 10,000 iterations, along with loss progression plots and extended visual comparisons for both training durations. 5.1. Training Loss Evolution 5.2. Quantitative Results at 10,000 Iterations 5.3. Qualitative Comparisons

Model accurate deformations, especially in object regions with Union scores across multiple object categories. Qualitacomplex structure or fine detail (e.g., hand, sword). These tively, it produces more realistic deformations, especially improvements validate the benefit of adversarial super- in regions with fine-grained geometry such as limbs or vision for guiding mesh optimization under weak 2D object extremities. supervision. Compared to the baseline, which often fails This approach contributes to the broader goal of buildto preserve sharp boundaries or introduces artifacts in ing generalizable, high-fidelity 3D reconstruction sysregions with occlusion or high curvature, our approach tems that operate under weak supervision. Our design maintains geometric consistency and enhances fidelity remains simple and modular, leveraging widely available to the silhouette and inner contours observed in the tar- toolkits such as PyTorch3D and standard GAN compoget images. Notably, at 10,000 iterations, the refinement nents. introduced by our method leads to significant alignment Future work will explore extending this framework to not only in the external silhouette but also in internal dynamic or articulated objects, learning category-specific features such as joint articulation and surface topology, priors, and incorporating temporal consistency for videoconfirming the progressive advantage of adversarial cues based shape reconstruction. Additionally, improving texover traditional loss-only optimization strategies. ture fidelity and integrating semantic segmentation into

Furthermore, the visual quality improvements ob- the adversarial loss are promising directions. served in later iterations indicate that the adversarial discriminator plays a crucial role in discouraging unrealistic deformations and encouraging plausible mesh Declaration on Generative AI structures even when direct pixel supervision is limited. During the preparation of this work, the authors This qualitative evidence complements the quantitative used ChatGPT, Grammarly in order to: Grammar and results reported in Section ??, and supports the hypothe- spelling check, Paraphrase and reword. After using this sis that leveraging learned priors from adversarial train- tool/service, the authors reviewed and edited the content ing leads to more robust and semantically coherent re- as needed and take full responsibility for the publication’s constructions, particularly when only sparse or partial content. supervision is available.

6. Conclusion We presented an adversarial framework for 3D shape

deformation guided by diferentiable rendering and 2D image supervision. By integrating a mesh generator with a self-attention-based discriminator, our method improves the visual quality and geometric accuracy of reconstructed 3D meshes from sparse image inputs.

Our results demonstrate that adversarial training can enhance mesh fidelity over standard diferentiable rendering pipelines. Quantitatively, our method achieves lower Chamfer distances and higher Intersection over

[1]

Liu ,

Li ,

Chen ,

Li , Soft rasterizer: A diferentiable renderer for image-based 3d reasoning, 2019 . URL: https://arxiv.org/abs/ 1904 .01786. arXiv: 1904 .01786.

[2]

Ravi ,

Reizenstein ,

Novotny ,

Gordon , W.-Y. Lo, J. Johnson, G. Gkioxari, Accelerating 3d deep learning with pytorch3d , arXiv: 2007 . 08501 ( 2020 ).

[3]

Brandizzi ,

Fanti ,

Gallotta ,

Russo ,

Iocchi ,

Nardi ,

Napoli , Unsupervised pose estimation by means of an innovative vision transformer , in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial In- common framework for image-guided mesh defortelligence and Lecture Notes in Bioinformatics) , vol- mation, 2023 . URL: https://arxiv.org/abs/2307.09892. ume 13589 LNAI, 2023 , p. 3 - 20 . doi: 10 .1007/ arXiv: 2307 . 09892 . 978 -3- 031 -23480- 4 _ 1 . [13]

Zhu ,

Zhang ,

Feng , Colorful 3d reconstruc -

[4]

Boutarfaia ,

Russo ,

Tibermacine ,

I. E.

Tiber - tion from a single image based on deep learning, macine, Deep learning for eeg-based motor imagery in: Proceedings of the 2020 3rd International Conclassification: Towards enhanced human-machine ference on Algorithms, Computing and Artificial interaction and assistive robotics , in: CEUR Work- Intelligence , ACAI '20, Association for Computing shop Proceedings , volume 3695 , 2023 , p. 68 - 74 . Machinery, New York, NY, USA, 2021 . URL: https:

[5]

De Magistris ,

Caprari , G. Castro, S. Russo, //doi.org/10.1145/3446132.3446157. doi: 10 .1145/ L. Iocchi,

Nardi ,

Napoli , Vision-based holis- 3446132 .3446157. tic scene understanding for context-aware human- [14]

Nicolet ,

Jacobson , W. Jakob, Large steps in robot interaction , in: Lecture Notes in Computer inverse rendering of geometry, ACM Trans. Graph. Science (including subseries Lecture Notes in Ar- 40 ( 2021 ). URL: https://doi.org/10.1145/3478513. tificial Intelligence and Lecture Notes in Bioinfor- 3480501. doi:10.1145/3478513.3480501. matics) , volume 13196 LNAI, 2022 , p. 310 - 325 . [15] Free3D . Com, 3d models for free, 2024 . URL: https: doi:10.1007/978-3- 031 -08421-8_ 21 . //free3d.com/3d-models/obj.

[6]

Russo ,

Ahmed ,

I. E.

Tibermacine ,

Napoli , Enhancing eeg signal reconstruction in cross-domain adaptation using cyclegan , in: Proceedings - 2024 International Conference on Telecommunications and Intelligent Systems, ICTIS 2024 , 2024 . doi: 10 . 1109/ICTIS62692. 2024 . 10894543 .

[7]

Połap ,

Woźniak ,

Napoli , E. Tramontana,

Damaševičius , Is the colony of ants able to recognize graphic objects? , Communications in Computer and Information Science 538 ( 2015 ) 376 - 387 . doi: 10 .1007/978-3- 319 -24770-0_ 33 .

[8]

Woźniak ,

Połap ,

Gabryel ,

R. K.

Nowicki ,

Napoli , E. Tramontana, Can we process 2d images using artificial bee colony? , in: Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science) , volume 9119 , 2015 , p. 660 - 671 . doi: 10 .1007/978-3- 319 -19324-3_ 59 .

[9]

Napoli ,

Pappalardo ,

Tramontana ,

Marszalek ,

Polap ,

Wozniak , Simplified firefly algorithm for 2d image key-points search , in: IEEE SSCI 2014 - 2014 IEEE Symposium Series on Computational Intelligence - CIHLI 2014 : 2014 IEEE Symposium on Computational Intelligence for HumanLike Intelligence , Proceedings, 2014 . doi: 10 .1109/ CIHLI. 2014 . 7013395 .

[10]

Lo Sciuto , G. Capizzi,

Shikler ,

Napoli , Organic solar cells defects classification by using a new feature extraction algorithm and an ebnn with an innovative pruning algorithm , International Journal of Intelligent Systems 36 ( 2021 ) 2443 - 2464 . doi: 10 .1002/int.22386.

[11]

Wang ,

Zhang ,

Li ,

Fu ,

Yu ,

Liu ,

Xue , Y.-G. Jiang, Pixel2mesh: 3d mesh model generation via image guided deformation , IEEE Transactions on Pattern Analysis and Machine Intelligence 43 ( 2021 ) 3600 - 3613 . doi: 10 .1109/TPAMI. 2020 . 2984232 .

[12]

Su ,

Liu ,

Niu ,

Wan ,

Wu , 3deformer: A