Abstract
Affordance grounding requires identifying where and how an agent should interact in open-world scenes, where actionable regions are often small, occluded, reflective, and visually ambiguous. Recent systems therefore combine multiple skills (e.g., detection, segmentation, interaction-imagination), yet most orchestrate them with fixed pipelines that are poorly matched to per-instance difficulty, offer limited targeted recovery from intermediate errors, and fail to amortize experience over recurring objects. We observe that many failures stem not from the lack of stronger models but from the lack of a system-level ability to actively acquire and validate evidence under bounded inference cost, where "verification" must rely on relative signals rather than ground-truth labels at test time. To this end, we propose Affordance Agent Harness, a closed-loop runtime that unifies heterogeneous skills with an evidence store and cost control, retrieves episodic memories to provide priors for recurring categories, and employs a Router to adaptively select and parameterize skills. Crucially, an affordance-specific Verifier gates commitments using self-consistency, cross-scale stability, and evidence sufficiency, triggering targeted retries when needed before a final judge fuses accumulated evidence and trajectories into the prediction. Experiments on multiple affordance benchmarks and difficulty-controlled subsets demonstrate a superior accuracy–cost Pareto frontier over fixed-pipeline baselines, improving grounding quality while reducing average skill calls and latency.
Method Overview
We propose Affordance Agent Harness (A-Harness), a novel closed-loop execution framework that unifies heterogeneous skills under a budgeted, evidence-seeking decision process. A-Harness addresses key challenges in open-world affordance grounding: fixed skill pipelines, lack of closed-loop correction, and isolated reasoning. Our method introduces adaptive routing, verification-driven retries, and episodic memory, achieving state-of-the-art performance across multiple benchmarks while reducing average skill calls and latency.
Comparison between a prior affordance agent with a fixed reasoning graph and our A-Harness–enabled agent. While prior systems execute skills along a predefined script with late fusion and no commitment gating, A-Harness introduces a context-aware, budgeted closed-loop runtime with adaptive routing, verification-driven retries, and persistent memory for reusable experience.
Method Details
A-Harness consists of four key components:
1. Evidence Store with Provenance
Accumulates heterogeneous skill outputs (boxes, masks, text) tagged with their source and zoom level to enable cross-skill agreement checks.
2. Two-Tier Memory
A Common-sense Bank for stable priors of frequent objects and a Test-time Episodic Bank that accumulates verifier-accepted successful trajectories for online adaptation.
3. Budget-Aware Router
Selects the next skill and its parameters by choosing the action most likely to resolve current uncertainty per unit cost (benefit-cost ratio).
4. Verifier
Sidesteps the absence of ground truth by using relative diagnostics (consistency, stability, sufficiency) to gate commitments and trigger targeted retries.
Overview of the A-Harness framework, illustrating iterative decision-making. The Verifier dynamically assesses evidence, guiding the Router to either re-plan or output results, while storing the trajectory in memory. The skill outcome $o_t$ is stored in the evidence store and combined with existing evidence to support the next step.
Illustration of heterogeneous skills that generate complementary visual and semantic evidence. Web search can retrieve both textual guidance and paired images when available (i.e. case(2)), enriching the visual context for affordance reasoning.
Quantitative Results
Quantitative results on ReasonAff and UMD datasets
| Model |
ReasonAff |
UMD |
| gIoU |
cIoU |
$P_{50}$ |
$P_{50-95}$ |
gIoU |
cIoU |
$P_{50}$ |
$P_{50-95}$ |
| VLPart |
4.21 |
3.88 |
1.31 |
0.85 |
-- |
-- |
-- |
-- |
| OVSeg |
16.52 |
10.59 |
9.89 |
4.12 |
-- |
-- |
-- |
-- |
| SAN |
10.21 |
13.45 |
7.18 |
3.17 |
-- |
-- |
-- |
-- |
| LISA-7B |
38.17 |
40.58 |
33.62 |
19.69 |
41.90 |
41.23 |
39.65 |
19.33 |
| SAM4MLLM |
45.51 |
33.64 |
43.48 |
22.79 |
12.40 |
8.41 |
4.12 |
0.05 |
| AffordanceLLM |
48.49 |
38.61 |
42.11 |
20.19 |
43.11 |
38.97 |
41.56 |
22.36 |
| InternVL3-8B |
31.79 |
24.68 |
35.41 |
21.93 |
-- |
-- |
-- |
-- |
| InternVL3-7B |
-- |
-- |
-- |
-- |
30.46 |
28.73 |
18.67 |
9.94 |
| Qwen2.5VL-7B |
25.18 |
20.54 |
26.00 |
15.82 |
33.21 |
29.83 |
25.17 |
10.45 |
| AffordanceVLM |
30.50 |
25.54 |
30.29 |
18.31 |
25.41 |
17.96 |
9.37 |
25.10 |
| Seg-Zero |
59.26 |
48.03 |
61.33 |
45.87 |
44.26 |
39.30 |
39.93 |
16.53 |
| Vision Reasoner |
63.04 |
52.70 |
67.33 |
47.23 |
44.00 |
39.71 |
39.04 |
16.10 |
| Affordance-R1 |
67.41 |
62.72 |
74.50 |
55.22 |
49.85 |
42.24 |
53.35 |
34.08 |
| ConverSeg |
30.11 |
25.08 |
30.50 |
17.02 |
33.27 |
10.37 |
32.63 |
13.59 |
| w/o A-Harness |
| Only w/ Det. & Seg. skills |
51.86 |
43.73 |
57.00 |
38.30 |
46.53 |
37.77 |
53.24 |
30.56 |
| Full Fixed Skill Chain |
55.05 |
49.57 |
58.07 |
37.48 |
50.19 |
49.24 |
55.88 |
29.75 |
| w/ A-Harness (Ours) |
| w/ Qwen-3.5-397B-A17B |
58.51 |
49.47 |
64.83 |
44.73 |
57.61 |
53.39 |
67.71 |
37.44 |
| w/ Gemini-3-flash |
58.27 |
47.25 |
63.68 |
45.91 |
51.33 |
46.49 |
53.60 |
28.17 |
| w/ GPT-4o |
60.53 |
54.91 |
66.73 |
45.53 |
52.74 |
50.04 |
57.62 |
29.85 |
| w/ GLM-5 |
60.72 |
55.02 |
66.78 |
45.44 |
54.28 |
54.06 |
61.76 |
33.94 |
| w/ Claude-Sonnet-4.6 |
66.48 |
62.82 |
73.19 |
53.38 |
53.72 |
51.15 |
62.50 |
33.31 |
| w/ Claude-Opus-4.6 |
69.68 |
70.88 |
77.50 |
56.35 |
54.94 |
55.04 |
64.67 |
36.80 |
Zero-shot comparison with baselines and ablation on $\mathcal{M}^\text{CS}$
| Method / Setting |
3DOI |
HANDAL-easy |
HANDAL-hard |
| gIoU |
cIoU |
gIoU |
cIoU |
gIoU |
cIoU |
| G-DINO |
4.1 |
3.9 |
3.6 |
3.0 |
3.4 |
3.1 |
| LISA |
12.3 |
8.1 |
15.5 |
11.9 |
12.3 |
8.1 |
| GLaMM |
4.4 |
2.9 |
4.7 |
3.5 |
5.0 |
3.5 |
| Vision-Reasoner |
39.6 |
30.3 |
29.6 |
19.8 |
27.7 |
16.7 |
| Affordance-R1 |
39.0 |
33.4 |
43.1 |
38.7 |
40.7 |
37.9 |
| AffordanceVLM |
38.1 |
39.4 |
58.3 |
58.1 |
58.2 |
57.8 |
| A-Harness (w/o $\mathcal{M}^{\text{CS}}$) |
56.5 |
47.2 |
58.4 |
57.6 |
55.3 |
55.1 |
| w/ $\mathcal{M}^\text{CS}$-a |
57.9 |
50.9 |
62.4 |
62.1 |
49.8 |
47.4 |
| w/ $\mathcal{M}^\text{CS}$-b |
63.8 |
52.0 |
61.4 |
63.3 |
60.6 |
61.0 |
| w/ $\mathcal{M}^\text{CS}$-c |
59.3 |
45.4 |
62.7 |
60.4 |
62.8 |
61.7 |
| w/ $\mathcal{M}^\text{CS}$-d |
64.1 |
52.8 |
61.8 |
61.7 |
48.0 |
42.5 |
| w/ $\mathcal{M}^\text{CS}$-e |
61.2 |
51.3 |
59.1 |
61.9 |
57.9 |
56.5 |
| w/ $\mathcal{M}^\text{CS}$-f |
65.6 |
53.7 |
63.5 |
61.8 |
55.2 |
49.7 |
| w/ $\mathcal{M}^\text{CS}$-g |
62.2 |
52.1 |
60.2 |
58.9 |
59.4 |
59.1 |
BibTeX
@misc{huang2026affordanceagentharnessverificationgated,
title={Affordance Agent Harness: Verification-Gated Skill Orchestration},
author={Haojian Huang and Jiahao Shi and Yinchuan Li and Yingcong Chen},
year={2026},
eprint={2605.00663},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2605.00663},
}