TBD-VLA: Temporal Block Diffusion Vision Language Action Model

1University of Virginia

Abstract

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models.

Overview

Overview of TBD-VLA

Overview of Temporal Block Diffusion Vision Language Action (TBD-VLA) model. TBD-VLA formulates action sequence generation as block discrete diffusion, which incorporates autoregression and discrete diffusion into a single framework. At inference time, action tokens are decoded in parallel within blocks and autoregressively between blocks. KV caching for the prefix further accelerates inference. TBD-VLA achieves SOTA results on multiple benchmarks in simulation and the real-world while retaining a competitive inference speed.

Method

TBD-VLA represents actions as discrete tokens and factorizes the action-sequence likelihood over temporal action blocks. Within a block, tokens are decoded in parallel via masked discrete diffusion; across blocks, generation is autoregressive. This combines the efficiency of parallel decoding with explicit temporal-level autoregression.

Training for TBD-VLA

Training for TBD-VLA. (A) Temporal-level token shift. To match the VLM backbone's autoregressive next-token-prediction objective, we shift the prediction target at the temporal level: the logits for the current action block are generated from the prior block. This bridges the gap between the self-reconstructive formulation of discrete diffusion and next-token prediction. (B) Block-level attention masking. A doubled-layout trick processes the clean action sequence and the partially masked (corrupt) action blocks in parallel under a custom attention mask that shares RoPE positions, parallelizing learning across blocks in a single pass and substantially accelerating training.

Inference. TBD-VLA combines several design choices for fast, temporally coherent decoding:

  • Decoding as needed. Only the blocks required for execution are generated, reducing the number of denoising steps.
  • Prefix KV cache. Visual/prompt tokens and previously generated blocks are cached, avoiding redundant computation across denoising steps.
  • Expectation sampling. Each scalar action component is decoded from the full predicted token distribution rather than the single most-likely token, providing a finer-grained decoding signal.
  • Real-Time Chunking (RTC). Future actions are generated asynchronously while current actions execute, using a hard in-painting strategy that aligns with TBD-VLA's masked block-diffusion training objective.

Generation Process

TBD-VLA generates the action sequence autoregressively between temporal blocks, while the tokens are unmasked in parallel within each block over a few discrete-diffusion steps. The most confident tokens are unmasked first. Each block conditions on the clean tokens of all previously generated blocks, giving the model explicit temporal structure while keeping decoding fast.

     Prefix
Vision
Language
State
action dim
autoregressive across blocks →

The grid shows action tokens laid out as action dimension (rows) × time (columns), partitioned into temporal blocks of size m. Within a block, masked tokens are revealed in parallel over nd diffusion steps (high-confidence tokens first); once a block is complete it is frozen and the next block is decoded conditioned on it.

Real-Time Chunking

To compensate for the inference latency in real-time control, TBD-VLA generates the next action chunk asynchronously using Real-Time Chunking (RTC). The tail of the previously generated chunk — the actions covering the latency window — is frozen and reused as an in-painting prefix for the next chunk. Each cycle the model denoises a prediction length of Ha + d timesteps: the first d are the frozen in-painting prefix, and the remaining Ha (the rollout horizon) are newly generated and then executed. This aligns well with TBD-VLA's masked block-diffusion objective, which is trained to complete action blocks from partial context, yielding temporally coherent actions despite latency.

now
timestep →

Benchmarks & Tasks

Benchmarks and tasks

Benchmarks and tasks. In simulation, TBD-VLA is evaluated across multiple robots: LIBERO and LIBERO-Plus using a Franka Panda arm, and SimplerEnv using the Google Robot and WidowX arm. In the real-world, three tabletop tasks are used to evaluate with a Franka Research 3 (FR3) arm.

How TBD-VLA Compares

Model Size Temporal AR Action Decoder Latency (s) ↓
SmolVLA0.5BFlow Matching0.297
GR00T-N12.2BFlow Matching0.131
π0.53BFlow Matching0.208
OpenVLA7BAutoregressive0.344
OpenVLA-OFT7BParallel0.031
MolmoAct7BAutoregressive5.633
π0-FAST3BAutoregressive0.767
Discrete Diffusion VLA7BDiscrete Diffusion0.069
VLA-03BAutoregressive1.980
TBD-VLA (Ha=12)2BBlock Discrete Diffusion0.117
TBD-VLA (Ha=8)2BBlock Discrete Diffusion0.087

Comparison of VLA models by size, temporal autoregression (AR), decoding strategy, and action generation latency in the LIBERO environment. VLA-0 is autoregressive in text strings. The latency of TBD-VLA scales with the rollout horizon Ha. TBD-VLA lags behind only two methods: OpenVLA-OFT and Discrete Diffusion VLA, both of which use purely parallel decoding without autoregression.

Results

LIBERO

Model SpatialObjectGoalLongAvg
OpenVLA-OFT96.298.396.290.795.4
π0-Fast96.496.888.660.285.5
π0.598.898.298.092.496.9
GR00T-N194.497.693.090.693.9
MolmoAct87.095.487.677.286.6
UniVLA95.498.893.694.095.5
VLA-097.097.896.287.694.7
Disc Diff VLA97.298.697.492.096.3
UD-VLA94.195.791.289.692.7
dVLA97.497.998.292.296.4
TBD-VLA97.699.697.496.697.7

Success rates (%) on the LIBERO benchmark across the four task suites. Best per column in bold; second-best underlined.

TBD-VLA achieves SOTA results on LIBERO at 97.7% average success rate.

RTC vs latency

LIBERO-Long success rate with/without RTC vs. inference latency. Stars denote zero added latency.

Real-Time Chunking under latency. Under an inference delay of 4 simulation steps, TBD-VLA with RTC retains a 93.2% success rate — 3.4% higher than π0.5 with RTC. Without RTC, performance degrades to 72.3% under the same latency, demonstrating the effectiveness of asynchronous inference enabled by TBD-VLA's temporal in-painting. See the visualization in the Real-Time Chunking section.

LIBERO-Plus

LIBERO-Plus results

Zero-shot robustness across LIBERO-Plus perturbation scenarios.

Model CameraRobotLanguageLight BackgroundNoiseLayoutAvg
OpenVLA-OFT55.621.781.092.791.078.668.767.9
UniVLA1.846.269.669.081.021.231.942.9
π0-Fast65.121.661.073.273.274.468.861.6
π013.86.058.885.081.479.068.953.6
RIPT-VLA55.231.277.688.491.673.574.268.4
TBD-VLA (w/o Pre-train)29.462.952.189.488.861.779.066.2
TBD-VLA (w/ Pre-train)87.860.477.495.888.889.984.483.5

Zero-shot success rates (%) on LIBERO-Plus across the seven perturbation factors. Best per column in bold; second-best underlined. Baselines use the official numbers from LIBERO-Plus.

On LIBERO-Plus, which applies controlled perturbations (object layout, camera viewpoint, robot initial state, language instruction, lighting, background texture, and sensor noise), TBD-VLA reaches 83.5% average success rate, outperforming the second-best method by 15.1%.

SimplerEnv — WidowX

Model Spoon on Towel Carrot on Plate Stack Block Eggplant in Basket Avg
Octo12.58.30.043.116.0
OpenVLA0.00.00.04.11.0
SpatialVLA20.820.825.070.834.4
π029.10.016.662.527.1
π0-FAST29.121.910.866.632.1
π0.544.429.218.163.938.9
UniVLA83.366.733.395.869.8
LLaDA-VLA56.976.330.658.355.5
Disc Diff VLA29.229.220.870.837.5
TBD-VLA52.086.831.297.266.8

Final-success rates (%) on SimplerEnv WidowX. Best per column in bold; second-best underlined.

SimplerEnv — Google Robot

Model Pick Can (VM)Move Near (VM)Drawer (VM)Avg (VM) Pick Can (VA)Move Near (VA)Drawer (VA)Avg (VA)
Octo17.04.222.716.80.63.11.11.1
OpenVLA16.346.235.627.754.547.717.739.8
SpatialVLA86.077.957.473.888.072.741.870.7
π072.765.338.358.875.263.725.654.8
π0-FAST75.367.542.961.977.668.231.359.0
InternVLA-M195.390.052.579.397.182.072.083.7
TBD-VLA99.285.088.991.097.278.383.486.3

Success rates (%) on SimplerEnv Google Robot under Visual Matching (VM) and Variant Aggregation (VA). "Drawer" averages opening and closing tasks. Best in bold; second-best underlined.

Rollouts

Pick Coke Can
Visual matching
โœ…
Move Near
Visual matching
โœ…
Open Drawer
Visual matching
โœ…
Close Drawer
Visual matching
โœ…
Pick Coke Can
Variant aggregation
โœ…
Move Near
Variant aggregation
โœ…
Open Drawer
Variant aggregation
โœ…
Close Drawer
Variant aggregation
โœ…

Real-World Experiments

We design three real-world tabletop tasks on a Franka Research 3 robot with global and in-hand RealSense D435 cameras, requiring long-horizon reasoning ("put every object on the table in the basket"), dexterity ("insert the bread into the toaster"), and reactiveness ("transfer the liquid"). Each method is evaluated under one in-distribution and three out-of-distribution settings (camera viewpoint, language, background/lighting), for 240 rollouts per method.

Real-world evaluation results

Real-world results. Across the three tasks, TBD-VLA achieves a 67.1% average success rate, outperforming π0.5 at 50.0%. RTC improves both methods; without RTC, TBD-VLA degrades to 60.0%. TBD-VLA maintains strong performance across out-of-distribution settings, demonstrating the effectiveness of temporal modeling with block diffusion.

In-distribution Rollouts

Everything in Bin
instruction: "move everything to basket"
โœ…
Load Toaster
instruction: "put the bread into the toaster"
โœ…
Transfer Liquid
instruction: "transfer the liquid"
โœ…

Out-of-distribution Rollouts

Transfer Liquid (Background & Lighting)
instruction: "transfer the liquid."
โœ…
Transfer Liquid (Language)
instruction: "transfer the Coke."
โœ…
Transfer Liquid (Camera)
instruction: "transfer the liquid."
โŒ

Design Choices & Efficiency

Configuration Success Rate (%) ↑ Inference Time (s) ↓ VLM Forward Passes ↓
m=1, nd=2, Expectation84.6 (−4.1)0.223 (+0.137)16 (+12)
m=16, nd=2, Expectation84.0 (−4.7)0.061 (−0.025)2 (−2)
m=4, nd=1, Expectation85.7 (−3.0)0.060 (−0.026)2 (−2)
m=4, nd=2, Argmax81.6 (−7.1)0.086 (0.000)4 (0)
m=4, nd=2, Expectation88.70.0864

SimplerEnv Google Robot success rate, inference time, and VLM forward passes across temporal block size m, per-block diffusion steps nd, and action sampling method. Measured on a single NVIDIA RTX A40 GPU.

Components Baseline + Decode as Needed + KV Cache + VLM Compile
Inference Speed (s) 0.185 0.125 ↓ (−0.060) 0.113 ↓ (−0.012) 0.086 ↓ (−0.027)

Inference speed breakdown. Decode-as-needed and KV caching are TBD-VLA optimizations; VLM compile applies PyTorch compilation to the VLM forward pass.

BibTeX

@article{lee2026tbdvlatemporalblockdiffusion,
      title={TBD-VLA: Temporal Block Diffusion Vision Language Action Model},
      author={Lee, Sung-Wook and Kang, Xuhui and Kuo, Yen-Ling},
      journal={arXiv preprint},
      year={2026},
}