PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

Accepted to AAAI Conference on Artificial Intelligence (AAAI 2026)

Xiuxiu Qi^1,2, Yu Yang³, Jiannong Cao², Luyao Bai², Chongshan Fan¹, Chengtai Cao⁴, Hongpeng Wang¹

¹Nankai University, ²The Hong Kong Polytechnic University,

³The Education University of Hong Kong, ⁴ City University of Hong Kong

Paper Supplementary Code arXiv All Videos

Abstract

Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (i.e., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL’s generalization under unseen and noisy object states.

Methodology

Motivation

Conventional Behavioral Cloning (BC) approaches suffer from two key multimodal grounding challenges that lead to inaccurate action cloning and incoherent execution.

Physical Discontinuities

Decoupled, per-step predictions fail to account for underlying motion dynamics. This leads to jerky trajectories (e.g., high-jerk robotic arm movements) and kinematically invalid transitions, ultimately causing failures in long-horizon tasks.
Semantic-Physical Alignment

Static fusion methods fail to dynamically align language instructions with changing visuomotor states. For example, when executing "place the cup on the shelf," the robot must be able to shift its attention from the "cup" (during grasping) to the "shelf" (during placement).

Overall Architecture

To address these challenges, we present Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment (CCoL). Our framework introduces two novel components: Multimodal Continuous Co-Learning (MCC) and Cross-modal Semantic-Physical Alignment (CSA). MCC leverages dynamic proprioceptive modeling to capture temporal evolution and maps multimodal representation into a shared latent space. CSA ensures stepwise synchronization of semantic information across modalities at each step. These components are built upon context-aware representation learning, which encodes fundamental multimodal inputs, and the fused enriched representations are used to generate contextually relevant and physically feasible action sequences.

Fig. 1: Overview of the CCoL framework. MCC leverages dynamic proprioceptive modeling to capture temporal evolution and maps multimodal inputs into a shared latent space (purple frame). CSA synchronizes stepwise semantic information across modalities (red frame) to enable the generation of contextually and physically feasible action sequences.

Multimodal Continuous Co-Learning (MCC)

To solve physical discontinuities, MCC introduces Neural Ordinary Differential Equations (NeuralODEs). Instead of modeling discrete, fragmented states, MCC captures the continuous evolution of proprioceptive embeddings by solving an initial value problem defined by a differential equation.

This approach provides temporally consistent representations that mitigate fragmentation and discontinuities found in conventional encoders. This directly results in the robust and smooth action trajectories seen in our experiments.
Cross-modal Semantic-Physical Alignment (CSA)

To solve semantic-physical misalignment, CSA introduces a bidirectional cross-attention mechanism. This mechanism dynamically anchors high-level linguistic concepts (e.g., "cube", "socket") to the robot's low-level visuomotor representations at each specific timestep.

Fig. 2: Illustration of attentive attribute map. Observations, task goals (text instructions), and robot proprioceptive data are tokenized into multimodal embeddings.

Attention scores are computed separately for noun words (object-focused) and verb phrases (action-focused), as well as proprioceptive states, producing heatmaps that highlight relevant visual regions and trajectory features. Summing the heatmaps across layers generates attentive attribution maps, ensuring semantic grounding in visuomotor control. The fused features are calculated by attending from language to the visuomotor context, and vice-versa, ensuring a precise semantic-to-physical correspondence at each step of the task.

Videos

Watch CCoL in action. We show results in simulation and on a real 7-DoF robot.

Simulation
Real-World

CCoL vs Baselines in Simulation

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}

Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

Abstract

Methodology

Motivation

Physical Discontinuities

Semantic-Physical Alignment

Overall Architecture

Multimodal Continuous Co-Learning (MCC)

Cross-modal Semantic-Physical Alignment (CSA)

Videos

CCoL vs Baselines in Simulation

Real-World: Pen Lifting

Real-World: Cube Sliding

Real-World: Cubes Placement

BibTeX