RLS3

RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception

Iowa State University, New York University
16th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS)

Abstract

Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies heavily on high-quality datasets to achieve successful performance in various downstream tasks. Additionally, VLMs often encounter limitations due to insufficient and imbalanced fine-tuning data. To address these issues, we propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent. Our method utilizes the RL agent to manipulate objects within an indoor setting to create synthetic data for fine-tuning to address certain vulnerabilities of the VLM. Specifically, we use the performance of the VLM to provide feedback to the RL agent to generate informative data that efficiently fine-tune the VLM over the targeted task (e.g. spatial reasoning). The key contribution of this work is developing a framework where the RL agent serves as an informative data sampling tool and assists the VLM in order to enhance performance and address task-specific vulnerabilities. By targeting the data sampling process to address the weaknesses of the VLM, we can effectively train a more context-aware model. In addition, generating synthetic data allows us to have precise control over each scene and generate granular ground truth captions. Our results show that the proposed data generation approach improves the spatial reasoning performance of VLMs, which demonstrates the benefits of using RL-guided data generation in vision-language tasks.

Method

The scheduler acts as a synchronizer between the processes. The loop begins with data generation where the RL agent takes T timesteps in which T₀ ≤ T image-metadata samples are generated from the Unity environment. In each step, the agent receives an intrinsic reward based on the feasibility of the generated sample. The episode is paused on the last step to allow the following processes to complete. Next, the metadata is used in the prompt generation process to create captions describing a spatial relation. The T₀ image-text pairs are then inputted to the VLM for inference. The performance of the VLM is used as an extrinsic reward signal J₂ for the RL agent at the end of the episode. Steps 1, 2, and 3 repeat for E RL episodes to generate a diverse batch of data for fine-tuning. Diversity is further increased in the fine-tuning batch by sampling the generated data with sampling rate η for each episode, resulting in a fine-tuning batch size of EηT₀. After fine-tuning the VLM finishes, the batch is cleared and the process repeats for the next iteration.

A Detailed Overview of the RLS3 Framework

Results

Efficient data generation

PaliGemma score (avg±std) on testing data vs cumulative generated data across 5 runs for both SAC and random agents

CLIP accuracy (avg±std) on testing data vs cumulative generated data across 3 runs for both SAC and random agents.

BibTeX

@inproceedings{10.1145/3716550.3722033, author = {Waite, Joshua R. and Hasan, Md Zahid and Liu, Qisai and Jiang, Zhanhong and Hegde, Chinmay and Sarkar, Soumik}, title = {RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception}, year = {2025}, isbn = {9798400714986}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3716550.3722033}, doi = {10.1145/3716550.3722033}, booktitle = {Proceedings of the ACM/IEEE 16th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2025)}, articleno = {28}, numpages = {10}, keywords = {self-improving sampling, spatial reasoning, synthetic data generation, vision-language models}, location = {Irvine, CA, USA}, series = {ICCPS '25} }

RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception

Abstract

Method

Unity Environment

Angle-based Prompt Generation

Results

Efficient data generation

Spatial Reasoning Performance

Dynamics of VLM Fine-Tuning

Acknowledgements

BibTeX