Affostruction: 3D Affordance Grounding with Generative Reconstruction

1POSTECH, 2RLWRLD
CVPR 2026
Teaser Image

Given single or multi-view RGBD images, Affostruction performs generative reconstruction to complete occluded regions and grounds affordances on the full shape, enabling progressive improvement through affordance-driven active view selection.

Abstract

This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose Affostruction, a generative framework that reconstructs complete geometry from partial observations and grounds affordances on the full shape including unobserved regions. We make three core contributions: generative multi-view reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures inherent ambiguity in affordance distributions, and affordance-driven active view selection that leverages predicted affordances for intelligent viewpoint sampling. Affostruction achieves 19.1 aIoU on affordance grounding (40.4% improvement) and 32.67 IoU for 3D reconstruction (67.7% improvement), enabling accurate affordance prediction on complete shapes.

Affostruction Framework Overview

Affostruction Framework Overview
Our approach consists of three stages. (1) Generative multi-view reconstruction: DINOv2 features from multiple RGBD views are fused into sparse voxels using depth and camera parameters. A Flow Transformer conditioned on these multi-view features and trained with stochastic multi-view training extrapolates complete 3D structure from partial observations, decoded via frozen sparse structure decoder (left). (2) Flow-based affordance grounding: A Sparse Flow Transformer conditioned on CLIP-encoded text query generates affordance heatmap logits over the reconstructed geometry (center). (3) Affordance-driven active view selection: We select next-best viewpoints by maximizing visibility of high-affordance regions, using frozen mesh decoder for surface extraction (right). This enables affordance prediction on complete geometry from partial observations, with predicted affordances guiding view selection to prioritize functional regions.

3D Reconstruction Results on Toys4k

Method Depth Geometry Appearance
IoU↑ CD↓ F-score↑ PSNR-N↑ LPIPS-N↓ PSNR↑ LPIPS↓
Shap-E [1] 6.39 0.6724 0.0096 16.16 0.3014 13.07 0.4358
InstantMesh [2] 13.68 0.4063 0.0391 20.46 0.2463 16.59 0.2989
LGM [3] 9.39 0.5660 0.0267 17.03 0.3974 13.59 0.4126
TRELLIS [4] 19.49 0.3694 0.0496 20.96 0.2089 17.61 0.2435
MCC [5] 21.11 0.3299 0.0648 N/A N/A N/A N/A
Affostruction (ours) 32.67 0.2427 0.0997 22.64 0.1421 18.84 0.1922

Complete 3D Affordance Grounding Results on Affogato-150K

Method aIoU↑ AUC↑ SIM↑ MAE↓
OpenAD [6] 3.1 64.8 0.329 0.150
PointRefer [7] 10.5 76.1 0.405 0.120
Espresso-3D [8] 13.6 79.0 0.429 0.111
Affostruction (ours) 19.1 72.0 0.426 0.217

Partial 3D Affordance Grounding Results on Affogato-150K

Method Recon. aIoU↑ aCD↓
OpenAD [6] 0.38 0.4165
PointRefer [7] 0.53 0.3072
Espresso-3D [8] 0.60 0.2885
TRELLIS [4] + OpenAD [6] 1.49 0.1671
TRELLIS [4] + PointRefer [7] 2.05 0.1576
TRELLIS [4] + Espresso-3D [8] 2.23 0.1568
MCC [5] + OpenAD [6] 3.34 0.1503
MCC [5] + PointRefer [7] 4.19 0.1397
MCC [5] + Espresso-3D [8] 4.74 0.1354
Affostruction (ours) 9.26 0.1044

Qualitative Results on Affogato-150K

Progressive Improvement via Active View Selection

Progressive Improvement via Active View Selection
Starting from challenging viewpoints where target areas are barely visible, Affostruction progressively refines geometry and localization through an iterative cycle: (1) generative reconstruction extrapolates complete structure from partial observations, (2) affordance prediction on reconstructed geometry, and (3) active view selection targeting informative viewpoints. Each iteration improves both reconstruction quality and prediction accuracy, revealing the synergistic relationship between these tasks. While only the selected view is shown for clarity, subsequent iterations leverage all accumulated observations through multi-view fusion.

References

[1] H. Jun and A. Nichol, "Shap-E: Generating Conditional 3D Implicit Functions," arXiv preprint arXiv:2305.02463, 2023.

[2] J. Xu, W. Cheng, Y. Gao, X. Wang, S. Gao, and Y. Shan, "InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-View Large Reconstruction Models," arXiv preprint arXiv:2404.07191, 2024.

[3] J. Tang, Z. Ren, H. Zhou, Z. Liu, and G. Zeng, "LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation," in ECCV, 2024.

[4] J. Xiang, X. Zeng, Z. Wu, Y. Lu, Y. Li, M.-H. Chen, and S.-H. Zhang, "TRELLIS: Structured 3D Latents for Scalable and Versatile 3D Generation," in CVPR, 2025.

[5] C.-Y. Wu, J. Johnson, J. Malik, C. Feichtenhofer, and G. Gkioxari, "Multiview Compressive Coding for 3D Reconstruction," in CVPR, 2023.

[6] T. Nguyen, M. N. Vu, A. Vuong, D. Nguyen, T. Vo, N. Le, and A. Nguyen, "Open-Vocabulary Affordance Detection in 3D Point Clouds," in IROS, 2023.

[7] T. Nguyen, M. N. Vu, B. Huang, T. V. Vo, V. Truong, N. Le, T. Vo, B. Le, and A. Nguyen, "Language-Conditioned Affordance-Pose Detection in 3D Point Clouds," in ICRA, 2024.

[8] J. Lee, E. Park, C. Park, D. Kang, and M. Cho, "Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale," arXiv preprint arXiv:2506.12009, 2025.

BibTeX

@inproceedings{park2026affostruction,
  title={Affostruction: 3D Affordance Grounding with Generative Reconstruction},
  author={Park, Chunghyun and Lee, Seunghyeon and Cho, Minsu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026},
}