IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks

一眼看懂封面预览

提出 IA-VLA (Input Augmentation for Vision-Language-Action) 框架，利用大型视觉语言模型…

Card 01 研究单位

研究单位

Aalto University - Intelligent Robotics Group, Department of Electrical Engineering and Automation, Espoo, Finland
University of Oulu - Biomimetics and Intelligent Systems Group, Faculty of Information Technology and Electrical Engineering, Oulu, Finland
Technical University of Denmark - Section of Mechanical Technology, Department of Engineering Technology and Didactics, Denmark

Card 02 论文概述

提出 IA-VLA (Input Augmentation for Vision-Language-Action) 框架，利用大型视觉语言模型（VLM）作为预处理阶段，生成改进的上下文来增强 VLA 输入
研究了 视觉重复对象（visual duplicates） 问题——即同一类别中视觉上无法区分的物体，需要通过空间关系来指定目标对象
在三种语义复杂的任务场景中进行评估：举起积木、填充花盆、打开抽屉，共进行了 1290 次评估运行

Card 03 核心贡献

Card 04 方法描述

Card 05 数据集与资源

- 举积木（Lifting blocks）：120 个演示，12 种语言指令，最多 6 个积木，3 种颜色

- 填充花盆（Filling pots）：120 个演示，2-4 个视觉无法区分的花盆

- 打开抽屉（Opening drawers）：600 个演示，12 种语言指令，3 行抽屉

Card 06 评估与结果

- Category 1: OpenVLA 51% → IA-VLA 73% → IA-VLA-relabeled 76%

- Category 3: OpenVLA 19% → IA-VLA 76% → IA-VLA-relabeled 70%

- Category 3: OpenVLA 20% → IA-VLA 53% → IA-VLA-relabeled 56%

- Category 3: OpenVLA 0% → IA-VLA 3% → IA-VLA-relabeled 68%