VLA Architecture

Learning Objectives:

Understand how VLA models unify vision, language, and robot actions
Trace the evolution from separate vision/language/policy models to unified VLAs
Understand the architecture of key VLA models (RT-2, Octo, OpenVLA)
Grasp the role of tokenization in mapping between modalities

Prerequisites: Module 3: AI-Robot Brain, basic understanding of transformers

Estimated Reading Time: 50 minutes

The Convergence

Traditional robot AI uses separate models for each capability:

Camera Image → [Vision Model] → Object Detections
Text Command → [Language Model] → Parsed Intent
(Detections + Intent) → [Policy Model] → Robot Actions

Vision-Language-Action (VLA) models collapse this pipeline into a single model:

Camera Image + Text Command → [VLA Model] → Robot Actions

This is significant because:

End-to-end learning: no hand-designed interfaces between modules
Language grounding: the model understands what "pick up the red cup" means visually
Generalization: one model handles many tasks via language instructions

Key VLA Models

RT-2 (Robotics Transformer 2)

Google's RT-2 fine-tunes a vision-language model (PaLM-E or PaLI-X) to output robot actions:

Component	Details
Base model	PaLM-E (562B) or PaLI-X (55B)
Input	Camera image + text instruction
Output	Robot actions as text tokens
Training data	130K robot demonstrations + web-scale VL data

The key insight: actions are tokenized as text. A robot action like "move arm to (0.3, 0.5, 0.2)" becomes the text string "1 128 256 128" (discretized action bins).

# Conceptual RT-2 inference
image = robot.get_camera_image()
instruction = "pick up the green block"

# The VLA model outputs action tokens as text
action_tokens = rt2_model.generate(
    image=image,
    text=instruction,
    max_tokens=7,  # 7 DoF action
)

# Decode tokens back to continuous actions
action = detokenize_action(action_tokens)
robot.execute(action)

Octo

A generalist robot policy from UC Berkeley:

Trained on 800K robot trajectories from the Open X-Embodiment dataset
Supports multiple robot embodiments (arms, mobile robots)
Uses a transformer architecture with action chunking

from octo.model import OctoModel

model = OctoModel.load_pretrained("hf://rail-berkeley/octo-base")

# Predict actions for a new task
task = model.create_tasks(texts=["pick up the cup"])
actions = model.sample_actions(
    observations={"image_primary": camera_image},
    tasks=task,
    rng=jax.random.PRNGKey(0),
)

OpenVLA

An open-source VLA built on Llama 2:

7B parameters
Fine-tuned on 970K robot episodes
Fully open-source (weights, code, data)

How VLA Models Work

1. Visual Tokenization

Images are converted to tokens using a vision encoder (ViT):

Image (224×224×3) → ViT → [v1, v2, ..., v256] (256 visual tokens)

2. Language Tokenization

Text instructions are tokenized using the LLM's tokenizer:

"pick up the red cup" → [t1, t2, t3, t4, t5] (5 text tokens)

3. Action Tokenization

Robot actions are discretized into bins:

# Discretize continuous action into 256 bins
def tokenize_action(action, num_bins=256):
    """Convert continuous action [-1, 1] to token index."""
    normalized = (action + 1) / 2  # [0, 1]
    bin_index = int(normalized * (num_bins - 1))
    return bin_index

# 7-DoF action becomes 7 tokens
action = [0.1, -0.3, 0.5, 0.0, 0.2, -0.1, 0.8]
tokens = [tokenize_action(a) for a in action]
# → [140, 89, 191, 128, 153, 115, 230]

4. Unified Transformer

All tokens are fed into a single transformer:

Input:  [v1...v256, t1...t5, a1...a7_prev]
Output: [a1...a7_next]  (next action prediction)

The Training Pipeline

Collect demonstrations: human teleoperation → (image, instruction, action) triples
Pre-train: large-scale vision-language data (internet images + text)
Fine-tune: robot demonstration data
Deploy: real-time inference on robot

Exercise: Explore VLA Tokenization

Load the OpenVLA tokenizer
Tokenize a sample image, instruction, and action sequence
Count the total tokens and compute the sequence length
Discuss: what happens as image resolution increases?

Summary

VLA models unify vision, language, and action into a single transformer
Actions are tokenized as discrete bins, enabling text-like generation
Key models: RT-2 (Google), Octo (Berkeley), OpenVLA (open-source)
VLAs enable language-conditioned robot control: "pick up the red cup"

Next: Chapter 2: VLA Training & Deployment — fine-tune and deploy VLAs.

The Convergence​

Key VLA Models​

RT-2 (Robotics Transformer 2)​

Octo​

OpenVLA​

How VLA Models Work​

1. Visual Tokenization​

2. Language Tokenization​

3. Action Tokenization​

4. Unified Transformer​

The Training Pipeline​

Exercise: Explore VLA Tokenization​

Summary​