Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

                      Accepted to AAAI Conference on Artificial Intelligence (AAAI 2026)                    
              Xiuxiu Qi1,2, Yu Yang3, Jiannong Cao2, Luyao Bai2, Chongshan Fan1, Chengtai Cao4, Hongpeng Wang1            
1Nankai University, 2The Hong Kong Polytechnic University,
3The Education University of Hong Kong, 4 City University of Hong Kong

Abstract

Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (i.e., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL’s generalization under unseen and noisy object states.

Methodology

Motivation

Conventional Behavioral Cloning (BC) approaches suffer from two key multimodal grounding challenges that lead to inaccurate action cloning and incoherent execution.

  1. Physical Discontinuities

    Decoupled, per-step predictions fail to account for underlying motion dynamics. This leads to jerky trajectories (e.g., high-jerk robotic arm movements) and kinematically invalid transitions, ultimately causing failures in long-horizon tasks.

  2. Semantic-Physical Alignment

    Static fusion methods fail to dynamically align language instructions with changing visuomotor states. For example, when executing "place the cup on the shelf," the robot must be able to shift its attention from the "cup" (during grasping) to the "shelf" (during placement).

Overall Architecture

To address these challenges, we present Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment (CCoL). Our framework introduces two novel components: Multimodal Continuous Co-Learning (MCC) and Cross-modal Semantic-Physical Alignment (CSA). MCC leverages dynamic proprioceptive modeling to capture temporal evolution and maps multimodal representation into a shared latent space. CSA ensures stepwise synchronization of semantic information across modalities at each step. These components are built upon context-aware representation learning, which encodes fundamental multimodal inputs, and the fused enriched representations are used to generate contextually relevant and physically feasible action sequences.

Overview of the CCoL framework

Fig. 1: Overview of the CCoL framework. MCC leverages dynamic proprioceptive modeling to capture temporal evolution and maps multimodal inputs into a shared latent space (purple frame). CSA synchronizes stepwise semantic information across modalities (red frame) to enable the generation of contextually and physically feasible action sequences.

  1. Multimodal Continuous Co-Learning (MCC)

    To solve physical discontinuities, MCC introduces Neural Ordinary Differential Equations (NeuralODEs). Instead of modeling discrete, fragmented states, MCC captures the continuous evolution of proprioceptive embeddings by solving an initial value problem defined by a differential equation.

    This approach provides temporally consistent representations that mitigate fragmentation and discontinuities found in conventional encoders. This directly results in the robust and smooth action trajectories seen in our experiments.

  2. Cross-modal Semantic-Physical Alignment (CSA)

    To solve semantic-physical misalignment, CSA introduces a bidirectional cross-attention mechanism. This mechanism dynamically anchors high-level linguistic concepts (e.g., "cube", "socket") to the robot's low-level visuomotor representations at each specific timestep.

    Attentive Attribute Map from CSA

    Fig. 2: Illustration of attentive attribute map. Observations, task goals (text instructions), and robot proprioceptive data are tokenized into multimodal embeddings.

    Attention scores are computed separately for noun words (object-focused) and verb phrases (action-focused), as well as proprioceptive states, producing heatmaps that highlight relevant visual regions and trajectory features. Summing the heatmaps across layers generates attentive attribution maps, ensuring semantic grounding in visuomotor control. The fused features are calculated by attending from language to the visuomotor context, and vice-versa, ensuring a precise semantic-to-physical correspondence at each step of the task.

Videos

Watch CCoL in action. We show results in simulation and on a real 7-DoF robot.

CCoL vs Baselines in Simulation

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}