This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose Affostruction, a generative framework that reconstructs complete object geometry from partial RGBD observations and grounds affordances on the full shape including unobserved regions. Our approach introduces sparse voxel fusion of multi-view features for constant-complexity generative reconstruction, a flow-based formulation that captures the inherent ambiguity of affordance distributions, and an active view selection strategy guided by predicted affordances. Affostruction outperforms existing methods by large margins on challenging benchmarks, achieving 19.1 aIoU on affordance grounding and 32.67 IoU for 3D reconstruction.
Sparse voxel fusion of multi-view RGBD features enables constant-complexity generative reconstruction, extrapolating complete geometry from partial observations.
A flow-based model predicts affordance heatmaps on reconstructed geometry conditioned on text queries, capturing the multi-modality of functional interactions.
Next-best views are selected based on affordance predictions to improve functional region coverage, achieving efficient reconstruction and grounding under limited view budgets.
| Method | Depth | Geometry | Appearance | |||||
|---|---|---|---|---|---|---|---|---|
| IoU↑ | CD↓ | F-score↑ | PSNR-N↑ | LPIPS-N↓ | PSNR↑ | LPIPS↓ | ||
| Shap-E [1] | 6.39 | 0.6724 | 0.0096 | 16.16 | 0.3014 | 13.07 | 0.4358 | |
| InstantMesh [2] | 13.68 | 0.4063 | 0.0391 | 20.46 | 0.2463 | 16.59 | 0.2989 | |
| LGM [3] | 9.39 | 0.5660 | 0.0267 | 17.03 | 0.3974 | 13.59 | 0.4126 | |
| TRELLIS [4] | 19.49 | 0.3694 | 0.0496 | 20.96 | 0.2089 | 17.61 | 0.2435 | |
| MCC [5] | ✔ | 21.11 | 0.3299 | 0.0648 | N/A | N/A | N/A | N/A |
| Affostruction (ours) | ✔ | 32.67 | 0.2427 | 0.0997 | 22.64 | 0.1421 | 18.84 | 0.1922 |
| Method | aIoU↑ | AUC↑ | SIM↑ | MAE↓ |
|---|---|---|---|---|
| OpenAD [6] | 3.1 | 64.8 | 0.329 | 0.150 |
| PointRefer [7] | 10.5 | 76.1 | 0.405 | 0.120 |
| Espresso-3D [8] | 13.6 | 79.0 | 0.429 | 0.111 |
| Affostruction (ours) | 19.1 | 72.0 | 0.426 | 0.217 |
| Method | Recon. | aIoU↑ | aCD↓ |
|---|---|---|---|
| OpenAD [6] | 0.38 | 0.4165 | |
| PointRefer [7] | 0.53 | 0.3072 | |
| Espresso-3D [8] | 0.60 | 0.2885 | |
| TRELLIS [4] + OpenAD [6] | ✔ | 1.49 | 0.1671 |
| TRELLIS [4] + PointRefer [7] | ✔ | 2.05 | 0.1576 |
| TRELLIS [4] + Espresso-3D [8] | ✔ | 2.23 | 0.1568 |
| MCC [5] + OpenAD [6] | ✔ | 3.34 | 0.1503 |
| MCC [5] + PointRefer [7] | ✔ | 4.19 | 0.1397 |
| MCC [5] + Espresso-3D [8] | ✔ | 4.74 | 0.1354 |
| Affostruction (ours) | ✔ | 9.26 | 0.1044 |
[1] H. Jun and A. Nichol, "Shap-E: Generating Conditional 3D Implicit Functions," arXiv preprint arXiv:2305.02463, 2023.
[2] J. Xu, W. Cheng, Y. Gao, X. Wang, S. Gao, and Y. Shan, "InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-View Large Reconstruction Models," arXiv preprint arXiv:2404.07191, 2024.
[3] J. Tang, Z. Chen, X. Chen, T. Wang, G. Zeng, and Z. Liu, "LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation," in ECCV, 2024.
[4] J. Xiang, Z. Lv, S. Xu, Y. Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang, "TRELLIS: Structured 3D Latents for Scalable and Versatile 3D Generation," in CVPR, 2025.
[5] C.-Y. Wu, J. Johnson, J. Malik, C. Feichtenhofer, and G. Gkioxari, "Multiview Compressive Coding for 3D Reconstruction," in CVPR, 2023.
[6] T. Nguyen, M. N. Vu, A. Vuong, D. Nguyen, T. Vo, N. Le, and A. Nguyen, "Open-Vocabulary Affordance Detection in 3D Point Clouds," in IROS, 2023.
[7] Y. Li, N. Zhao, J. Xiao, C. Feng, X. Wang, and T.-S. Chua, "LASO: Language-guided Affordance Segmentation on 3D Object," in CVPR, 2024.
[8] J. Lee, E. Park, C. Park, D. Kang, and M. Cho, "Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale," arXiv preprint arXiv:2506.12009, 2025.
@inproceedings{park2026affostruction,
title={Affostruction: 3D Affordance Grounding with Generative Reconstruction},
author={Park, Chunghyun and Lee, Seunghyeon and Cho, Minsu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026},
}