L3 · PAI-110

Imitation and Sim-to-Real

Clone a demonstrated trajectory with behavioral cloning to RMSE <= 0.12 m and correctly explain why a sim-perfect policy can fail on real hardware.

Challenge

Try this first — before any explanation.

Same rover, course C3 (an S-curve). Teleop the rover through the S-curve once; the Bench records (state, action) pairs. Train a behavioral-cloning policy to imitate your drive and run it autonomously. The trap: cloning fits your demo beautifully yet the cloned path slowly drifts off your line and into the wall on the second curve. The moment the policy makes a tiny error, the rover is in a state you never demonstrated — so it can't recover, and the error compounds. That drift is distribution shift, the same mechanism that breaks sim-trained policies on real hardware.

→

The Bench

Clone the demo, watch it drift, apply a remedy, and explain the sim-to-real gap.

action. A single demo is a NARROW tube;\n# the clone drifts the moment it steps outside it (distribution shift). seed=550, C3.\nimport numpy as np\n\ndef expert_demo(noise=0.0, seed=550):\n # an S-curve reference path; \"expert\" follows it; record (state, action).\n r = np.random.default_rng(seed)\n states, actions, path = [], [], []\n x, y, th = 0.0, 0.0, 0.0\n for t in range(210):\n target_th = 0.6 * np.sin(x * 1.1) # the S-curve heading law\n err = target_th - th\n a = 0 if abs(err) < 0.05 else (1 if err > 0 else 2) # FWD/LEFT/RIGHT\n th += (0.08 if a == 1 else (-0.08 if a == 2 else 0.0)) + r.normal(0, noise)\n x += 0.05*np.cos(th); y += 0.05*np.sin(th)\n states.append([x, target_th, err]); actions.append(a); path.append((x, y))\n return np.array(states), np.array(actions), np.array(path)\n\nS, A, demo_path = expert_demo()\nprint(\"demo states:\", S.shape, \" (a narrow tube around the S-curve)\")\n","label":"1 — Record a demo & clone it (the clone drifts off the line)"},{"code":"import numpy as np\nRECOVERY = False # <-- set True to add DAgger-lite recovery data (the remedy)\nRANDOMIZE = False # <-- alternative remedy: domain randomization\n\nclass Clone:\n # nearest-neighbor cloned policy over demo states (deterministic).\n def fit(self, S, A): self.S, self.A = S, A; return self\n def act(self, s):\n i = int(np.argmin(np.sum((self.S - s)**2, axis=1))); return int(self.A[i])\n def train_score(self):\n return float(np.mean([self.act(s) == a for s, a in zip(self.S, self.A)]))\n\nS_train, A_train = S.copy(), A.copy()\nif RANDOMIZE:\n for sd in (551, 552):\n s2, a2, _ = expert_demo(noise=0.02, seed=sd)\n S_train = np.vstack([S_train, s2]); A_train = np.append(A_train, a2)\n\npolicy = Clone().fit(S_train, A_train)\n\ndef run(policy, add_recovery=False):\n x, y, th = 0.0, 0.0, 0.0; drift = []\n if add_recovery: # DAgger-lite where it drifts\n rec_s, rec_a = [], []\n xx, yy, tt = 0.0, 0.0, 0.15 # start off-line (curve 2 region)\n for t in range(210):\n target_th = 0.6*np.sin(xx*1.1); err = target_th - tt\n rec_a.append(0 if abs(err) < 0.05 else (1 if err > 0 else 2))\n rec_s.append([xx, target_th, err])\n tt += (0.08 if rec_a[-1]==1 else (-0.08 if rec_a[-1]==2 else 0))\n xx += 0.05*np.cos(tt); yy += 0.05*np.sin(tt)\n policy.fit(np.vstack([policy.S, rec_s]), np.append(policy.A, rec_a))\n for t in range(210):\n target_th = 0.6*np.sin(x*1.1); err = target_th - th\n a = policy.act([x, target_th, err])\n th += (0.08 if a==1 else (-0.08 if a==2 else 0.0))\n x += 0.05*np.cos(th); y += 0.05*np.sin(th)\n if t < len(demo_path):\n drift.append(abs(y - demo_path[t][1]))\n drift = np.array(drift)\n return {\"match_rmse\": float(np.sqrt(np.mean(drift**2))),\n \"max_drift\": float(drift.max()), \"collided\": bool(drift.max() > 0.35),\n \"reached\": bool(x > 9.0)}\n\nresult = run(policy, add_recovery=RECOVERY)\nprint(\"train fit\", round(policy.train_score(), 2), \" match_rmse\",\n round(result[\"match_rmse\"], 3), \" max_drift\", round(result[\"max_drift\"], 3),\n \" collided\", result[\"collided\"])\n","label":"2 — Clone, run, measure the compounding drift"},{"code":"# Answer the three-concept check (autograded on coverage, not opinion).\n# Fill each string; the grader checks for the required idea.\nMECHANISM = \"distribution shift\" # the named mechanism\nSIM_TO_REAL_CAUSE = \"real-world friction, sensor noise and wheel slip produce states outside the sim training distribution\"\nMITIGATION = \"domain randomization widens the training distribution so real states are no longer out-of-distribution\"\nprint(\"concept answers recorded.\")\n","label":"3 — Sim-to-real concept check (name mechanism + cause + fix)"},{"code":"import numpy as np\nrmse = result[\"match_rmse\"]; mx = result[\"max_drift\"]\ndef has(s, kws): s = s.lower(); return any(k in s for k in kws)\nmech_ok = has(MECHANISM, [\"distribution shift\", \"covariate shift\", \"unseen state\"])\ncause_ok = has(SIM_TO_REAL_CAUSE, [\"friction\", \"slip\", \"noise\", \"latency\", \"out\"]) \\\n and has(SIM_TO_REAL_CAUSE, [\"sim\", \"training\", \"distribution\", \"out\"])\nmit_ok = has(MITIGATION, [\"domain random\", \"dagger\", \"recovery\", \"fine-tun\"]) \\\n and has(MITIGATION, [\"widen\", \"distribution\", \"coverage\", \"out-of-distribution\"])\nif rmse > 0.12 or result[\"collided\"]:\n print(f\"FAIL: match_rmse {rmse:.3f} m (need <=0.12), collided {result['collided']} — \"\n f\"classic compounding drift; add recovery demos AT the states where it drifts \"\n f\"(set RECOVERY=True), not where it already matches.\")\nelif mx > 0.20:\n print(f\"FAIL: drift bound {mx:.3f} m (need <=0.20) — single demo too narrow; collect \"\n f\"recovery data or train with RANDOMIZE=True.\")\nelif not mech_ok:\n print(\"FAIL concept: name WHY a small error grows — the policy enters states absent \"\n \"from its training data and can't recover (distribution shift).\")\nelif not (cause_ok and mit_ok):\n print(\"FAIL concept: state the sim->real cause (real states outside the training \"\n \"distribution) AND a mitigation's EFFECT (it widens that distribution).\")\nelse:\n print(f\"PASS: clone matches to {rmse:.3f} m, drift bounded {mx:.3f} m, and you named \"\n f\"distribution shift, its sim->real cause, and a mitigation that widens coverage.\")\n","label":"4 — Autograder (PASS = RMSE<=0.12, drift<=0.20, concepts present)"}],"intro":"Clone the demo, watch it drift, apply a remedy, and explain the sim-to-real gap.","key":"programming/imitation-and-sim-to-real","kind":"python","title":"Imitation and Sim-to-Real"}">

PYTHON · NUMPY · IN-BROWSER

Imitation and Sim-to-Real

Clone the demo, watch it drift, apply a remedy, and explain the sim-to-real gap.

Model

The idea, built visually.

You drove the perfect line, the policy copied you — and on paper it's flawless. So why, when it drives itself, does it wander off? You only ever demonstrated the good line — a narrow tube. The first time the clone slips even slightly outside it, it's in a state you never showed it; it has no idea how to get back, so the error snowballs. That's distribution shift.

Now scale up: a policy trained in a clean sim has only ever seen sim states — perfect friction, no slip. The real world hands it states the sim never produced. Same mechanism: the sim is the demonstration, reality is off the line. The cures all do one thing — widen the tube: show recovery from mistakes (DAgger), or randomize the sim so 'normal' already includes reality's mess.

▣ Stage animation: A clean blue demo path traces the S-curve; an amber clone starts on it, makes one tiny error, steps outside a shaded 'demonstrated tube', and spirals into the wall with a counter 0.02->0.09->0.31 m; the tube relabels TRAINING vs a wider lumpier RUN-TIME cloud; a cut to a real rover slipping on a bumpy floor; two fixes (DAgger, domain randomization) widen the tube.

Guided practice

Build it up, step by step.

Step 1 (worked): record, clone, and overlay demo (blue) vs clone (amber) with live match_rmse. Step 2 (worked): plot distance-from-demo vs time and watch it compound. Step 3 (faded): pick ONE remedy — DAgger-lite recovery demos (RECOVERY=True) or domain randomization (RANDOMIZE=True) — and re-run under tolerance. Step 4 (independent): answer the structured sim-to-real explanation.

Feedback

How the Bench grades your run.

PASS WHEN Trajectory-match RMSE <= 0.12 m and max drift <= 0.20 m with no collision, AND the concept check names the mechanism (distribution shift), its sim->real cause, and a valid mitigation with its effect.

FAIL: match_rmse too high and collided — compounding drift; add recovery demos at the states where it drifts (curve 2), not where it already matches.
FAIL: drift bound above 0.20 m — single demo too narrow; collect recovery data or train with randomization.
FAIL concept: you described the symptom but not the mechanism — name why a small error grows (the policy enters states absent from its training data and can't recover).

Retrieve & space

Bring back what you've already mastered.

From 3.1: behavioral cloning is the 3.1 classifier with a different label source — what replaced the expert's hand-labels? (Your own teleop demonstration.)
From 3.2: BC needed no reward but only saw the demo's states; which suffers distribution shift more, and why? (BC — it never explores off-demo.)
From 2.1: name two real-world state perturbations the sim never generated (wheel slip, friction drift, sensor latency).

Mastery gate

What you must demonstrate to advance.

Module 3 exit gate: BC policy meets RMSE <= 0.12 m, completes C3 with no collision, bounds drift <= 0.20 m, and passes the three-concept sim-to-real check (L3: clone to tolerance and reason about the sim-to-real gap).

Project

How this feeds your build.

Closes Module 3's capstone contribution: the classifier (mode selector), RL policy (learned navigation), and sim-to-real reasoning integrate into 5.1; learned navigation ships as doorway_policy(state) -> Command(heading=...).

← PreviousReinforcement Learning Next →GPIO, Peripherals & the Control Loop