The scheduler acts as a synchronizer between the processes. The loop begins with data generation where the RL agent takes T timesteps in which T0 ≤ T image-metadata samples are generated from the Unity environment. In each step, the agent receives an intrinsic reward based on the feasibility of the generated sample. The episode is paused on the last step to allow the following processes to complete. Next, the metadata is used in the prompt generation process to create captions describing a spatial relation. The T0 image-text pairs are then inputted to the VLM for inference. The performance of the VLM is used as an extrinsic reward signal J2 for the RL agent at the end of the episode. Steps 1, 2, and 3 repeat for E RL episodes to generate a diverse batch of data for fine-tuning. Diversity is further increased in the fine-tuning batch by sampling the generated data with sampling rate η for each episode, resulting in a fine-tuning batch size of EηT0. After fine-tuning the VLM finishes, the batch is cleared and the process repeats for the next iteration.