Affostruction: 3D Affordance Grounding with Generative Reconstruction

Chunghyun Park¹, Seunghyeon Lee¹, Minsu Cho^1,2

¹POSTECH, ²RLWRLD
CVPR 2026

Given single or multi-view RGBD images, Affostruction performs generative reconstruction to complete occluded regions and grounds affordances on the full shape, enabling progressive improvement through affordance-driven active view selection.

Affostruction Framework Overview

Our approach consists of three stages. (1) Generative multi-view reconstruction: DINOv2 features from multiple RGBD views are fused into sparse voxels using depth and camera parameters. A Flow Transformer conditioned on these multi-view features and trained with stochastic multi-view training extrapolates complete 3D structure from partial observations, decoded via frozen sparse structure decoder (left). (2) Flow-based affordance grounding: A Sparse Flow Transformer conditioned on CLIP-encoded text query generates affordance heatmap logits over the reconstructed geometry (center). (3) Affordance-driven active view selection: We select next-best viewpoints by maximizing visibility of high-affordance regions, using frozen mesh decoder for surface extraction (right). This enables affordance prediction on complete geometry from partial observations, with predicted affordances guiding view selection to prioritize functional regions.

3D Reconstruction Results on Toys4k

Method	Depth	Geometry					Appearance
Method	Depth	IoU↑	CD↓	F-score↑	PSNR-N↑	LPIPS-N↓	PSNR↑	LPIPS↓
Shap-E [1]		6.39	0.6724	0.0096	16.16	0.3014	13.07	0.4358
InstantMesh [2]		13.68	0.4063	0.0391	20.46	0.2463	16.59	0.2989
LGM [3]		9.39	0.5660	0.0267	17.03	0.3974	13.59	0.4126
TRELLIS [4]		19.49	0.3694	0.0496	20.96	0.2089	17.61	0.2435
MCC [5]	✔	21.11	0.3299	0.0648	N/A	N/A	N/A	N/A
Affostruction (ours)	✔	32.67	0.2427	0.0997	22.64	0.1421	18.84	0.1922

Complete 3D Affordance Grounding Results on Affogato-150K

Method	aIoU↑	AUC↑	SIM↑	MAE↓
OpenAD [6]	3.1	64.8	0.329	0.150
PointRefer [7]	10.5	76.1	0.405	0.120
Espresso-3D [8]	13.6	79.0	0.429	0.111
Affostruction (ours)	19.1	72.0	0.426	0.217

Partial 3D Affordance Grounding Results on Affogato-150K

Method	Recon.	aIoU↑	aCD↓
OpenAD [6]		0.38	0.4165
PointRefer [7]		0.53	0.3072
Espresso-3D [8]		0.60	0.2885
TRELLIS [4] + OpenAD [6]	✔	1.49	0.1671
TRELLIS [4] + PointRefer [7]	✔	2.05	0.1576
TRELLIS [4] + Espresso-3D [8]	✔	2.23	0.1568
MCC [5] + OpenAD [6]	✔	3.34	0.1503
MCC [5] + PointRefer [7]	✔	4.19	0.1397
MCC [5] + Espresso-3D [8]	✔	4.74	0.1354
Affostruction (ours)	✔	9.26	0.1044

Qualitative Results on Affogato-150K

Progressive Improvement via Active View Selection

Starting from challenging viewpoints where target areas are barely visible, Affostruction progressively refines geometry and localization through an iterative cycle: (1) generative reconstruction extrapolates complete structure from partial observations, (2) affordance prediction on reconstructed geometry, and (3) active view selection targeting informative viewpoints. Each iteration improves both reconstruction quality and prediction accuracy, revealing the synergistic relationship between these tasks. While only the selected view is shown for clarity, subsequent iterations leverage all accumulated observations through multi-view fusion.

References

[1] H. Jun and A. Nichol, "Shap-E: Generating Conditional 3D Implicit Functions," arXiv preprint arXiv:2305.02463, 2023.

[2] J. Xu, W. Cheng, Y. Gao, X. Wang, S. Gao, and Y. Shan, "InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-View Large Reconstruction Models," arXiv preprint arXiv:2404.07191, 2024.

[3] J. Tang, Z. Ren, H. Zhou, Z. Liu, and G. Zeng, "LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation," in ECCV, 2024.

[4] J. Xiang, X. Zeng, Z. Wu, Y. Lu, Y. Li, M.-H. Chen, and S.-H. Zhang, "TRELLIS: Structured 3D Latents for Scalable and Versatile 3D Generation," in CVPR, 2025.

[5] C.-Y. Wu, J. Johnson, J. Malik, C. Feichtenhofer, and G. Gkioxari, "Multiview Compressive Coding for 3D Reconstruction," in CVPR, 2023.

[6] T. Nguyen, M. N. Vu, A. Vuong, D. Nguyen, T. Vo, N. Le, and A. Nguyen, "Open-Vocabulary Affordance Detection in 3D Point Clouds," in IROS, 2023.

[7] T. Nguyen, M. N. Vu, B. Huang, T. V. Vo, V. Truong, N. Le, T. Vo, B. Le, and A. Nguyen, "Language-Conditioned Affordance-Pose Detection in 3D Point Clouds," in ICRA, 2024.

[8] J. Lee, E. Park, C. Park, D. Kang, and M. Cho, "Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale," arXiv preprint arXiv:2506.12009, 2025.