Language: Put Red Pepper on Yellow Plate
To advance autonomous dexterous manipulation, we propose a hybrid control method that combines the relative advantages of a fine-tuned Vision-Language-Action (VLA) model and diffusion models.
The VLA model provides language commanded high-level planning, which is highly generalizable, while the diffusion model handles low-level interactions which offers the precision and robustness required for specific objects and environments. By incorporating a switching signal into the training-data, we enable event based transitions between these two models for a pick-and-place task where the target object and placement location is commanded through language. This approach is deployed on our anthropomorphic ADAPT Hand 2, a 13DoF robotic hand, which incorporates compliance through series elastic actuation allowing for resilience for any interactions: showing the first use of a multi-fingered hand controlled with a VLA model.
We demonstrate this model switching approach results in a over 80% success rate compared to under 40% when only using a VLA model, enabled by accurate near-object arm motion by the VLA model and a multi-modal grasping motion with error recovery abilities from the diffusion model.
(a) Combined VLA and Diffusion policy approach for dexterous manipulation which uses an event signal σ to transition between the different models, enabling text input to be translated to hand and wrist commands for a anthropomorphic manipulator. (b) Depiction of the concept to switch between the VLA and diffusion model using a common event signal σ that tracks key moments in the pick-and-place task.
(a) Left) Robot setup for gathering training-data through teleoperation, showing the use of the Vision pro, and the location of the two cameras for capturing training data. Right) The test objects and environment used for data-capture and testing. (b) ADAPT Hand 2, highlighting the soft continuous skin, compliant series elastic finger joints, and the anatomically driven design.
The VLA model is fine-tuned based the pre-trained openVLA model on the dataset excluding the grasping periods. Instead of using images from a single camera as shown originally, we combine two images from cam1 and cam2. The two images are resized to 224x144 ad 224x80, to then be vertically concatenated into a single image. The fine-tuning continues until the training action accuracy exceed 95% and converges. The fine-tuning is loaded on a cluster virtual machine with a single A100-80GB. The training of a diffusion policy model is based on the collected grasping demonstration. The model is trained for 1500 epochs on a custom GPU.
Language: Put Red Pepper on Yellow Plate
Language: Put Tape on Purple Plate