L3 · PAI-110

Reinforcement Learning

Define a reward function and train an RL policy in-sim to convergence (smoothed reward >= 6.0) with a greedy rollout that reaches the goal safely and efficiently.

01
Challenge

Try this first — before any explanation.

Same rover, new course C2 (open arena, one goal pad, one hazard). There are no labels and no expert — you're handed a training loop already wired up. It calls one function you must write: reward(obs, action, next_obs). Make the rover reliably reach the goal by writing only the reward. The trap: the obvious sparse reward (+1 at goal, 0 otherwise) leaves the curve flat near zero — the policy almost never stumbles onto the goal by chance. Reward design, not the algorithm, is the lever.

The Bench

Write the reward function (progress + safety + time) so the provided RL loop converges.

220\n o[\"collided\"] = o[\"in_hazard\"]\n return o\n\ndef wrap_(a): return (a + np.pi) % (2*np.pi) - np.pi\n\ndef reward(obs, action, next_obs):\n if next_obs[\"reached\"]: return 1.0\n return 0.0 # SPARSE — curve will stay flat near zero\nprint(\"harness ready; sparse reward defined (it will fail by design).\")\n","label":"1 — Tiny RL harness (provided) + your sparse reward (flat curve)"},{"code":"# Dense progress signal + terminal bonus; YOU add hazard + time penalties.\nHAZARD_PENALTY = 0.0 # <-- TUNE: must exceed the progress gained by clipping the corner\nTIME_PENALTY = 0.0 # <-- TUNE: small constant, e.g. 0.01, so it doesn't loiter\n\ndef reward(obs, action, next_obs):\n progress = obs[\"goal_dist\"] - next_obs[\"goal_dist\"] # >0 when getting closer\n r = 2.0 * progress\n if next_obs[\"reached\"]:\n r += 10.0\n if next_obs[\"in_hazard\"]:\n r -= HAZARD_PENALTY\n r -= TIME_PENALTY\n return r\nprint(\"shaped reward set. hazard_penalty =\", HAZARD_PENALTY, \" time_penalty =\", TIME_PENALTY)\n","label":"2 — Shaped reward (progress + your safety/time terms)"},{"code":"# A compact, seedable trainer: a softmax policy over 4 actions conditioned on a\n# coarse state bin; climbs the reward you designed. Deterministic for grading.\nimport numpy as np\n\ndef train(reward_fn, episodes=600, seed=440):\n r = np.random.default_rng(seed)\n theta = np.zeros((6, 4)) # 6 state bins x 4 actions\n def feat(o):\n b = min(5, int(o[\"goal_dist\"]))\n return b\n curve = []\n for ep in range(episodes):\n env = Arena(seed + (ep % 3)); o = env.reset()\n total = 0.0; grads = []\n for t in range(221):\n b = feat(o); z = theta[b] - theta[b].max(); p = np.exp(z); p /= p.sum()\n a = int(r.choice(4, p=p))\n no = env.step(a); rw = reward_fn(o, a, no); total += rw\n g = -p; g[a] += 1.0; grads.append((b, g, rw))\n o = no\n if o[\"done\"]: break\n for (b, g, rw) in grads:\n theta[b] += 0.02 * g * (total / 50.0) # crude REINFORCE update\n curve.append(total)\n train.policy = theta\n return curve\n\ndef smoothed(c, w=20):\n c = np.array(c); k = np.ones(w)/w\n return np.convolve(c, k, mode='valid')\n\ncurve = train(reward, 600, 440)\nfinal = float(smoothed(curve)[-1])\nprint(\"final smoothed reward:\", round(final, 2))\n","label":"3 — Train to convergence (deterministic policy-gradient-lite)"},{"code":"import numpy as np\ndef rollout(theta, seed):\n env = Arena(seed); o = env.reset()\n def feat(o): return min(5, int(o[\"goal_dist\"]))\n steps = 0\n for t in range(221):\n a = int(np.argmax(theta[feat(o)])); o = env.step(a); steps += 1\n if o[\"done\"]: break\n return o.get(\"reached\", False), o.get(\"collided\", False), steps\n\nsm = smoothed(curve); slope = sm[-1] - sm[-min(100, len(sm))]\nreached, collided, steps = rollout(train.policy, 440)\ncurve2 = train(reward, 600, 441); final2 = float(smoothed(curve2)[-1])\nif final < 6.0 or slope < 0:\n print(f\"FAIL: final smoothed reward {final:.1f} (need >=6.0, non-decreasing). Reward too \"\n f\"sparse -> add a dense per-step term for closing goal_dist (potential shaping).\")\nelif collided:\n print(\"FAIL: converged but collided in hazard — HAZARD_PENALTY is smaller than the \"\n \"progress gained by clipping the zone; raise it above that progress.\")\nelif not reached:\n print(\"FAIL: reward high but reached=False — shaping rewards moving, not arriving; \"\n \"keep the terminal bonus and confirm progress uses next_obs minus obs.\")\nelif steps > 220:\n print(f\"FAIL: reached but steps {steps} (>220) — no time penalty; subtract a small \"\n f\"constant each step (~0.01-0.05).\")\nelif final2 < 6.0:\n print(f\"FAIL: seed 440 passed but seed 441 reward {final2:.1f} — magnitudes tuned to one \"\n f\"init; prefer potential-based shaping invariant to seed.\")\nelse:\n print(f\"PASS: converged to {final:.1f}, reached goal safely in {steps} steps, \"\n f\"re-converges at seed 441 ({final2:.1f}).\")\n","label":"4 — Autograder (PASS = reward>=6, reach, safe, efficient, seed 441)"}],"intro":"Write the reward function (progress + safety + time) so the provided RL loop converges.","key":"programming/reinforcement-learning","kind":"python","title":"Reinforcement Learning"}">
PYTHON · NUMPY · IN-BROWSER

Reinforcement Learning

Write the reward function (progress + safety + time) so the provided RL loop converges.

02
Model

The idea, built visually.

Last lesson you gave the rover answers — labeled examples. But what if all you can give it is a score, higher is better, and let it figure out the rest? Think of the reward as a landscape and learning as climbing it. A reward that's flat everywhere except one pinprick at the goal? The policy is blind — it wanders, never feeling which way is up.

Shape the landscape so getting closer already pays a little, and now there's a slope to climb. Episode by episode the policy nudges toward actions that scored well; watch the smoothed reward rise and cross the line — that's convergence. We never told it the path; we shaped the incentive, and the path fell out.

▣ Stage animation: The arena lifts into a 3-D reward surface: with sparse reward a flat plain with one lonely spike, the policy ball wandering; it morphs to a gentle slope toward the goal and the ball rolls uphill; a split shows the reward curve crossing a dashed convergence threshold while the rover's C2 path straightens episode by episode.

03
Guided practice

Build it up, step by step.

Step 1 (worked): run the harness with the sparse reward and plot the flat curve. Step 2 (worked): the provided potential-based progress term (2.0 * (goal_dist shrink)) plus terminal bonus. Step 3 (faded): add the hazard penalty (must exceed corner-cut progress) and a small per-step time penalty. Step 4 (independent): train to convergence and evaluate the greedy rollout.

04
Feedback

How the Bench grades your run.

PASS WHEN Smoothed final reward >= 6.0 and non-decreasing over the last 100 episodes, greedy rollout reaches the goal with no hazard entry in <= 220 steps, and re-converges at seed 441.

  • FAIL: final reward low and curve flat — reward is sparse; add a dense per-step term for closing goal_dist (potential-based shaping).
  • FAIL: converged but collided in hazard — hazard penalty < corner-cut progress; raise the penalty above the progress gained by clipping the zone.
  • FAIL: reward high but reached=False — shaping rewards moving not arriving; keep the terminal bonus and confirm progress = next_obs minus obs.
05
Retrieve & space

Bring back what you've already mastered.

  • From 3.1: what does RL have that supervised classification did not, that let it learn with no labels? (A reward signal / trial-and-error.)
  • From 2.2/2.3: which is easier to guarantee never enters the hazard, the FSM rule or the learned policy, and why? (Rules give guarantees; learned policies give statistics.)
  • From 2.1: re-train with heading_err removed from the observation — what happens to convergence and why? (The policy can only learn from what's in its state.)
06
Mastery gate

What you must demonstrate to advance.

Reward drives the RL loop to smoothed reward >= 6.0 non-decreasing over last 100 episodes, greedy rollout reaches goal with no hazard entry in <= 220 steps, re-converges at seed 441 (L3: design a reward for a safe, efficient policy and repair a degenerate curve).

07
Project

How this feeds your build.

Feeds the capstone (5.1) as the learned navigation component integrated beneath the FSM; in 5.2 its inference loop is a candidate to push to the metal if profiling flags it.