RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception

Iowa State University, New York University
16th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS)
RLS3

Abstract

Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies heavily on high-quality datasets to achieve successful performance in various downstream tasks. Additionally, VLMs often encounter limitations due to insufficient and imbalanced fine-tuning data. To address these issues, we propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent. Our method utilizes the RL agent to manipulate objects within an indoor setting to create synthetic data for fine-tuning to address certain vulnerabilities of the VLM. Specifically, we use the performance of the VLM to provide feedback to the RL agent to generate informative data that efficiently fine-tune the VLM over the targeted task (e.g. spatial reasoning). The key contribution of this work is developing a framework where the RL agent serves as an informative data sampling tool and assists the VLM in order to enhance performance and address task-specific vulnerabilities. By targeting the data sampling process to address the weaknesses of the VLM, we can effectively train a more context-aware model. In addition, generating synthetic data allows us to have precise control over each scene and generate granular ground truth captions. Our results show that the proposed data generation approach improves the spatial reasoning performance of VLMs, which demonstrates the benefits of using RL-guided data generation in vision-language tasks.

Method

The scheduler acts as a synchronizer between the processes. The loop begins with data generation where the RL agent takes T timesteps in which T0 ≤ T image-metadata samples are generated from the Unity environment. In each step, the agent receives an intrinsic reward based on the feasibility of the generated sample. The episode is paused on the last step to allow the following processes to complete. Next, the metadata is used in the prompt generation process to create captions describing a spatial relation. The T0 image-text pairs are then inputted to the VLM for inference. The performance of the VLM is used as an extrinsic reward signal J2 for the RL agent at the end of the episode. Steps 1, 2, and 3 repeat for E RL episodes to generate a diverse batch of data for fine-tuning. Diversity is further increased in the fine-tuning batch by sampling the generated data with sampling rate η for each episode, resulting in a fine-tuning batch size of EηT0. After fine-tuning the VLM finishes, the batch is cleared and the process repeats for the next iteration.

datprep

A Detailed Overview of the RLS3 Framework

Unity Environment

There is one active scene at a time, which is cycled over the episode. Each step selects one of the three active objects and is swapped with another in the object container. Only a handful of the available objects are shown here.

lifestages

Angle-based Prompt Generation

The center of 'Object A' is located on the origin of the eight regions in the horizontal direction and 3 regions in the vertical direction in which spatial terms are selected. The horizontal regions are aligned such that 'behind' is facing towards the camera. The cameras are aligned with the horizontal and vertical axes in this figure to more clearly show the regions. The generated caption for this scenario is: “The small pot is above, behind and to the left of the yellow bowl."

lifestages

Results

Efficient data generation

PaliGemma Model

PaliGemma score (avg±std) on testing data vs cumulative generated data across 5 runs for both SAC and random agents

PaliGemma Model

CLIP accuracy (avg±std) on testing data vs cumulative generated data across 3 runs for both SAC and random agents.

Spatial Reasoning Performance

PaliGemma Model

PaliGemma score (avg±std) separated by spatial term for RLS3 with an SAC and random agent. Cumulative term counts of data generated for fine-tuning are given in () for SAC and [] for random agents

PaliGemma Model

Average PaliGemma score by prompt complexity vs iteration for SAC agent

Dynamics of VLM Fine-Tuning

PaliGemma Model

Concatenated PaliGemma loss plots for iterative fine-tuning

PaliGemma Model

PaliGemma score (avg±std) on validation data vs fine-tuning iteration across 5 runs for the SAC agent with the early stopping point indicated

Acknowledgements

This work was partly supported by the National Science Foundation, USA, under grants CAREER CNS-1845969 and CPS Frontier CNS-1954556.

BibTeX


       @inproceedings{10.1145/3716550.3722033,
        author = {Waite, Joshua R. and Hasan, Md Zahid and Liu, Qisai and Jiang, Zhanhong and Hegde, Chinmay and Sarkar, Soumik},
        title = {RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception},
        year = {2025},
        isbn = {9798400714986},
        publisher = {Association for Computing Machinery},
        address = {New York, NY, USA},
        url = {https://doi.org/10.1145/3716550.3722033},
        doi = {10.1145/3716550.3722033},
        booktitle = {Proceedings of the ACM/IEEE 16th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2025)},
        articleno = {28},
        numpages = {10},
        keywords = {self-improving sampling, spatial reasoning, synthetic data generation, vision-language models},
        location = {Irvine, CA, USA},
        series = {ICCPS '25}
        }