Affostruction: 3D Affordance Grounding with Generative Reconstruction

1POSTECH, 2Ewha Womans University, 3RLWRLD
CVPR 2026
Teaser Image

Given single or multi-view RGBD images, Affostruction performs generative reconstruction to complete occluded regions and grounds affordances on the full shape, enabling progressive improvement through affordance-driven active view selection.

Abstract

This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose Affostruction, a generative framework that reconstructs complete geometry from partial observations and grounds affordances on the full shape including unobserved regions. We make three core contributions: generative multi-view reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures inherent ambiguity in affordance distributions, and affordance-driven active view selection that leverages predicted affordances for intelligent viewpoint sampling. Affostruction achieves 19.1 aIoU on affordance grounding (40.4% improvement) and 32.67 IoU for 3D reconstruction (67.7% improvement), enabling accurate affordance prediction on complete shapes.

Affostruction Framework Overview

Affostruction Framework Overview
Our approach consists of three stages. (1) Generative multi-view reconstruction: DINOv2 features from multiple RGBD views are fused into sparse voxels using depth and camera parameters. A Flow Transformer conditioned on these multi-view features and trained with stochastic multi-view training extrapolates complete 3D structure from partial observations, decoded via frozen sparse structure decoder (left). (2) Flow-based affordance grounding: A Sparse Flow Transformer conditioned on CLIP-encoded text query generates affordance heatmap logits over the reconstructed geometry (center). (3) Affordance-driven active view selection: We select next-best viewpoints by maximizing visibility of high-affordance regions, using frozen mesh decoder for surface extraction (right). This enables affordance prediction on complete geometry from partial observations, with predicted affordances guiding view selection to prioritize functional regions.

3D Reconstruction Results on Toys4k

3D Reconstruction Results

Complete 3D Affordance Grounding Results on Affogato-150K

Complete 3D Affordance Grounding Results

Partial 3D Affordance Grounding Results on Affogato-150K

Partial 3D Affordance Grounding Results

Qualitative Results on Affogato-150K

Progressive Improvement via Active View Selection

Progressive Improvement via Active View Selection
Starting from challenging viewpoints where target areas are barely visible, Affostruction progressively refines geometry and localization through an iterative cycle: (1) generative reconstruction extrapolates complete structure from partial observations, (2) affordance prediction on reconstructed geometry, and (3) active view selection targeting informative viewpoints. Each iteration improves both reconstruction quality and prediction accuracy, revealing the synergistic relationship between these tasks. While only the selected view is shown for clarity, subsequent iterations leverage all accumulated observations through multi-view fusion.

BibTeX


        @inproceedings{park2026affostruction,
          title={Affostruction: 3D Affordance Grounding with Generative Reconstruction},
          author={Park, Chunghyun and Lee, Seunghyeon and Cho, Minsu},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
          year={2026},
         }