IS-431: 5.3.2 Intelligent Behaviour & Control

5.3.2 Intelligent Behaviour & Control

RI-1, RI-2, RI-4 to RI-7

Fusion ▸ Inference ▸ Sparring ▸ LLM ▸ Sessions

The CV model from 5.3.1 produces a stream of action predictions; this section covers what happens to that stream next. All intelligent behaviour runs as a network of ROS 2 Humble nodes on the Jetson that fuse the CV predictions with IMU pad impacts, feed them into an adaptive sparring engine, and hand the resulting session data to the LLM Coach and the dashboard (5.3.3).

5.3.2.1 CV + IMU Sensor Fusion

RI-4

The CV model on its own says "the user just threw a left hook". The IMU pad on its own says "something hit me hard". Neither is enough, CV can hallucinate punches into thin air, and the IMU can't tell a hook from a cross. The punch_processor node fuses the two into confirmed punch events using three layers of logic: temporal matching, biomechanical filtering, and graceful fallback when one sensor fails.

1. Temporal matching, lining up CV and IMU in time

When an IMU pad impact exceeds 5.0 m/s² (after gravity calibration), the node queries the CV prediction ring buffer within a ±500 ms window centred on the impact timestamp. Frame-count voting selects the dominant prediction type within that window, and the result is emitted as a confirmed punch carrying both the CV confidence and the IMU force.

2. Pad-constraint filtering, rejecting biomechanically impossible matches

Even with a clean temporal match, some CV predictions are physically impossible, a left hook cannot land on the right pad. Each pad therefore only accepts the punch types that could biomechanically reach it:

Centre pad: jab and cross only
Left pad: left hook and left uppercut only
Right pad: right hook and right uppercut only
Head pad: any punch type

If the primary CV prediction violates this constraint, secondary predictions are checked against the same rules with a minimum confidence threshold of 0.25, this is what catches the model's retraction-phase wobble that 5.3.4.8 Limitations refers to.

3. Fallbacks, graceful degradation when one sensor fails

Both layers above assume CV and IMU are both producing usable data. When one of them fails, the camera obscured, a pad disconnected, the node still emits a best-effort punch event so downstream nodes never starve:

CV-only mode: ≥3 consecutive frames at ≥0.6 confidence, emitted with a 0.3 penalty
IMU-only mode: default 0.3 confidence, punch type inferred from pad location

The pipeline diagram below ties everything together, CV predictions and IMU pad impacts both flow into the punch_processor's ring buffer, get matched within the ±500 ms window, pass through the pad-constraint filter, and finally emerge as a confirmed punch event published to the rest of the system.

End-to-end behaviour of this pipeline is validated in 5.3.4.5, where the three-panel clip shows raw CV predictions (bottom-left), IMU pad impacts (right, via the Teensy simulator), and the fusion-filtered punch stream (top-left) running together.

5.3.2.2 Real-Time Inference Pipeline

RI-1, RI-2, RI-5

Getting the action model from "works on a recorded clip" to "works live on the robot" required four things in sequence, each addressed in turn below:

Frame rate the model could trust.
Per-frame compute to actually hit it.
Output stability so predictions didn't flicker between classes.
Resource budget to leave room for the LLM Coach on the same GPU.

1. Frame rate, why 30fps matters

The action prediction model was trained on features extracted at a consistent 30fps. Its voxel temporal deltas use 2-frame (67ms) and 8-frame (267ms) lookbacks, if the actual frame rate drops below 30fps, these deltas span different real-world time intervals than what the model learned, causing the model to misclassify punches. In other words, the action model's accuracy is directly tied to maintaining 30fps, even a small drop degrades predictions significantly.

2. Per-frame compute, the YOLO bottleneck

Knowing 30fps was non-negotiable, the next question was whether the pipeline could actually hit it on the Jetson. The answer at first was no, and the bottleneck was a single component:

YOLO bottleneck on the Jetson

Problem: the single biggest deployment challenge was YOLO Pose. In its default PyTorch (.pt) format, YOLO alone consumed most of the 33 ms per-frame budget, dragging the whole pipeline below 30 fps. Since YOLO provides the pose features the action model depends on, this meant the action model was being fed features at inconsistent, lower frame rates, accurate offline, broken live.

Fix: convert YOLO to a TensorRT engine (.engine format, FP16 precision, half-precision 16-bit floats, which run roughly twice as fast on GPU as the default 32-bit format with negligible accuracy loss). TensorRT optimises the computation graph for the Jetson's GPU, fusing operations and eliminating overhead. This brought YOLO inference down from ~33 ms to ~16 ms, comfortably under the frame budget.

Result: the action model became accurate and reliable in live deployment immediately , confirming the accuracy issue was never the model, only the frame rate feeding it.

Why does this speedup matter?

Both models run in series every frame, YOLO extracts the pose features the action model immediately consumes, and together they have to fit inside the 33 ms frame budget to hold 30 fps. Because the action model is small (~1.75M parameters) and was already running at ~8 ms, the entire budget effectively came down to how fast YOLO could run. That makes the TensorRT speedup the deciding factor between a pipeline that holds 30 fps and one that doesn't.

The chart below shows how that single conversion turns a frame that missed the 33 ms budget into one that fits with headroom to spare:

What does TensorRT actually do?

Conversion is more than a file rename. When TensorRT compiles a model for a specific Jetson, it does four things:

Layer fusion

Adjacent operations (e.g. conv + bias + ReLU) are fused into a single GPU kernel, eliminating intermediate memory writes.

Kernel autotuning

For each fused operation, TensorRT benchmarks several CUDA implementations on the actual hardware and picks the fastest one.

FP16 quantisation

Every weight is converted from 32-bit to 16-bit floating point, halving the memory bandwidth with negligible accuracy loss.

Engine serialisation

The optimised plan is serialised to a single binary file that loads in milliseconds, no recompilation on subsequent runs.

The result is the same model mathematically running with roughly half the memory bandwidth and a fraction of the kernel launch overhead, exactly what the inference pipeline needed to sustain 30 fps.

Three more compute-side issues

The TensorRT speedup got the per-frame budget under control, but three secondary issues stood between offline accuracy and live performance:

Sequential processing

Issue: voxel extraction and YOLO ran one after the other, doubling per-frame time.

Fix: parallelised into separate worker threads, running concurrently.

Buffer discontinuities

Issue: dropped frames created gaps in the temporal window the model relied on.

Fix: rolling FIFO feature buffer with strict frame ordering.

FPS mismatch

Issue: trained at a steady 30 fps but the live pipeline ran at variable rates.

Fix: TensorRT optimisation of both models gave consistent throughput.

Final per-frame budget

With the TensorRT speedup and the three compute fixes in place, here is what a single frame actually costs end-to-end:

Component	Latency	Notes
YOLO Pose (TensorRT)	~16ms	On GPU, single frame
Depth → voxel extraction	~10ms	Background subtraction + voxelisation
YOLO + voxel in parallel	~16ms	Separate threads, limited by slower
Model forward pass	~8ms	~1.75M params, d=192
Total per frame	~24ms	42fps theoretical, 30fps practical

The 24ms-per-frame inference budget is the CV pipeline only. The full end-to-end latency from camera capture to motor command is ~120ms, the remaining ~96ms covers RealSense capture buffering, ROS 2 message passing through punch_processor, the ±500ms fusion window's earliest-match logic, and the motor command queue. This is verified against requirement RI‑2 (≤150ms) in 5.3.4.7.

3. Output stability, cleaning up noisy predictions

A stable 30fps gets the model the inputs it expects, but the outputs still need work. Even on perfectly clean frames, the model's per-frame top class can flicker between two similar punches for a few frames at a time. Publishing every raw prediction would generate spurious events, so three filtering stages stand between the model and the rest of the system:

EMA smoothing, α = 0.35: An exponential moving average over per-class confidences damps out frame-to-frame jitter so a single noisy frame can't flip the prediction.
Hysteresis gate, 12% confidence margin: The new top class must beat the current top class by 12% confidence before the prediction is allowed to switch, this prevents two close classes from oscillating.
State machine, 2 frames to enter, 3 to hold, ≥0.78 sustain (block: 4): A class must hold for at least 2 frames at ≥0.78 confidence before it is committed, and 3 more frames to stay committed. Block requires 4 consecutive frames because of its biomechanical similarity to uppercut onset.

The CV pipeline shares the Jetson's GPU with the LLM Coach; the adaptive frame rate and memory budgeting that make this coexistence possible are detailed in 5.3.2.4.

With frame rate, compute, and output stability handled, the action model now runs reliably in real time, the foundation that the rest of this section builds on top of. Licensing implications of the tech stack (including YOLO's AGPL-3.0 licence) are addressed on the landing page.

5.3.2.3 AI Sparring Engine

RI-6

The 5 sparring styles and their Markov chain combo generation are introduced in 5.1 GUI. This section covers the ROS-side execution: how sparring_engine turns those style definitions into live robot behaviour. Five jobs run in a loop every round:

Pace attacks at the right intensity (difficulty scaling)
Adapt to the user's weak areas (weakness tracking)
Stay safe when something else in the system goes wrong (engine safeguards)
Read the user's response to each strike (defence detection)
Reactive mode for unstructured practice (free training)

1. Pacing attacks, difficulty scaling

Parameter	Easy	Medium	Hard
Attack interval	2.0s	1.2s	0.7s
Counter-punch probability	30%	50%	80%
Arm speed	8 rad/s	15 rad/s	25 rad/s

2. Adapting to the user, weakness tracking

Difficulty levels alone make sparring harder; they don't make it smarter. The sparring_weakness_profile database table records, for every punch type, how often the user successfully blocked it versus how often they were hit, and these stats persist across sessions so the robot concentrates on the gaps each time the user returns rather than feeding them openings they already handle.

Example: A user blocks 90% of jabs but only 40% of hooks. The engine will throw more hooks and fewer jabs in subsequent rounds, forcing the user to practise defending the punch type they struggle with most.

3. Staying safe, engine safeguards

A relentless attack engine that doesn't know when to stop is dangerous, both to the user and to the motors. Four runtime safeguards keep the engine safe and well-behaved when something elsewhere in the system goes wrong:

Single strike at a time

A robot_busy flag blocks any new RobotCommand while the previous strike is still executing, no motor command pile-ups.

Session watchdog

Listens for the SessionState heartbeat (2 Hz) and deactivates itself if none arrives for 5 seconds, a crashed session_manager never leaves the robot free-swinging.

Block reaction

If the user successfully blocked the last strike, the engine forces a different punch type for the next attack, the AI never feeds the same opening twice in a row.

Idle surprise

If the user goes idle for more than 3 seconds, the engine fires the next attack at 60% of the normal interval, breaking up rest periods that creep into a round.

4. Reading the user's response, defence detection

The weakness profile from #2 only works if the engine actually knows whether the user defended a strike or got hit by it. Every time the robot throws a punch, a 500 ms detection window opens and the punch_processor classifies the user's response using a priority-based system, physical contact wins over visual cues, but visual cues catch the cases where the user dodged cleanly:

Priority	Response	Detection Method
1	Hit	Arm IMU detects physical contact, the robot's punch landed
2	Block	CV model detects block pose (confidence ≥ 0.3)
3	Slip	Large displacement, user's bounding box moved ≥40px laterally or ≥0.15m in depth
4	Dodge	Moderate displacement, ≥20px laterally or ≥0.08m in depth
5	Unknown	No detectable response within the 500ms window

Defence results feed into the sparring analytics (defence rate, breakdown by type) and the weakness tracking system, which uses them to target the user's weakest areas in future sessions.

5. Reactive mode, free training

Sparring mode is structured: the robot drives, the user defends. Free training flips the loop, the user drives, the robot reacts. Whenever the user hits a pad, the robot fires back a contextually appropriate counter-punch instead of following a scripted sequence. This is the simplest engine in the section because there's no Markov chain, no weakness profile, and no difficulty scaling, just a pad-to-counter mapping:

User Hits	Robot Responds With
Centre pad	Jab or Cross
Left pad	Left Hook or Left Uppercut
Right pad	Right Hook or Right Uppercut
Head pad	Jab or Cross

A 300ms cooldown prevents rapid-fire counters, and the robot returns to guard position after 5 seconds of inactivity.

5.3.2.4 LLM Coach

RI-7

The CV pipeline and sparring engine (5.3.2.1–5.3.2.3) collect rich performance data (per-punch accuracy, block rates, power trends, weakness profiles) and the dashboard (5.3.3) visualises it. But raw statistics alone do not help a user improve. An experienced boxer seeing a 40% hook block rate knows to tighten their guard on the open side; a beginner seeing the same number does not know what is going wrong or what to practise. RI-7 requires the system to bridge this gap by interpreting session data into actionable, personalised guidance grounded in boxing domain knowledge.

A local LLM Coach running entirely on the Jetson meets this requirement without cloud dependency. (The same system is surfaced to users on the phone dashboard under the label AI Coach, see 5.3.3.3; the two names refer to the same model.) The coach appears in two places in the product: real-time tips on the robot's touchscreen during active rounds, and post-session analysis plus interactive chat on the phone dashboard. This subsection covers the on-device model itself; the dashboard-side plumbing that exposes it over FastAPI is documented in 5.3.3.

Building an on-device coach that is actually useful inside a real-time boxing system means answering five questions in order: why local at all, which model fits, how to make it sound like a coach, what happens when it fails, and how to share the GPU with the CV pipeline. The subsections below tackle each in turn.

1. Why local?, the constraint that shapes everything else

Cloud-based AI coaching would add network latency and require stable internet, neither of which is guaranteed in a gym environment. Running the LLM on-device addresses three problems at once:

Latency

Tips return in ~0.5–2 s with no network round-trip.

Reliability

Coaching keeps working when the gym WiFi doesn't.

Privacy

Training data never leaves the robot.

Committing to local raises the obvious follow-up: which model is small enough to fit alongside the CV pipeline on the Jetson but capable enough to actually coach?

2. Which model fits?, selection & deployment

Model	Gemma 4 E2B (Google, April 2026), edge-optimised with Per-Layer Embeddings, 5.1B total / 2.3B active parameters, multimodal (text + image)
Quantisation	GGUF Q4_K_M (~3.1GB), fits alongside CV models in shared GPU memory
Runtime	llama-cpp-python with full GPU offload
Response time	0.5–2 seconds per response
Context window	2,048 tokens
Generation limit	128 tokens for real-time tips; 100/256/512 tokens for chat (short/normal/detailed)

When does the coach actually speak?

Picking Gemma 4 E2B answers the “what runs” question, but a model loaded in memory is just an idle model. The engine wakes the LLM in three different situations, each with its own context payload and output style:

Mode	When	What It Does
Real-time tips	Every ~18 seconds during active rounds	1–2 sentence coaching tips based on recent punches, combo progress, and session stats. Displayed on the GUI's CoachTipBar widget.
Post-session analysis	After session completes	2–3 paragraph detailed analysis referencing specific metrics (punch count, power trends, weak areas). Can suggest specific drills via structured tags.
Interactive chat	Anytime via phone dashboard	Full conversation with the AI coach. Users can ask technique questions, request training plans, or get explanations of their stats. Word-by-word streaming display.

3. Sounding like a coach, prompt, parameters, and knowledge base

A general-purpose LLM doesn't know boxing out of the box and doesn't sound like a coach unless you tell it to. Three pieces of configuration shape its voice and its domain expertise: a system prompt that defines its persona, a tuned parameter set that balances quality against the real-time loop, and a curated knowledge base that grounds its advice in real boxing technique.

System prompt

A 24-line embedded system prompt instructs the model to act as “BoxBunny AI Coach” with knowledge of all 6 punch types and their codes, 5 boxing styles (European, Russian, American, Cuban, hybrid), and AIBA methodology (AIBA, 2015). The prompt enforces plain text output (no markdown formatting) and adjusts advice depth to the user's skill level (beginner / intermediate / advanced). Different system prompt keys are used for each of the three coaching modes above.

Inference parameters

Each parameter below was tuned during development to balance response quality, speed, and the real-time constraint that the LLM cannot block the rest of the system:

Parameter	Value	Why This Value
Temperature	0.7	Balances variety with coherence, lower values produced repetitive tips, higher values caused off-topic responses
Max tokens	128	Keeps real-time tips to 1–2 sentences, prevents mid-sentence cut-offs within the generation budget
Context window	2,048	Large enough to fit system prompt + session stats + knowledge base context without exceeding memory
GPU layers	All (-1)	Full GPU offload for fastest inference, CPU fallback too slow for real-time tips
Inference timeout	20s	If inference hangs (GPU contention, model issue), the system serves a fallback tip instead of waiting indefinitely

A standalone PySide6 chat GUI (llm_chat_gui.py) was built for iterative prompt tuning. It supports streaming token display, model selection, configurable system prompts, and sentence completion detection, used extensively to refine coaching quality before deployment. The interface is shown below.

Boxing knowledge base (retrieval-augmented generation)

The system prompt and parameters tell the model how to talk; the knowledge base tells it what to talk about. This follows a retrieval-augmented generation (RAG) approach: 17 curated boxing documents are injected into the LLM's context window during coaching analysis, grounding its advice in verified boxing technique rather than the model's general-purpose training data. This is the primary mechanism preventing hallucination; without the knowledge base, the LLM could generate plausible-sounding but incorrect technique advice that a beginner would have no way to verify.

Category	Docs	Content
Techniques	5	Jab, cross, hooks, uppercuts, stance & guard
Defence	3	Blocking, slipping, footwork
Combinations	3	Beginner to advanced progressions
Training plans	2	8-week beginner & intermediate programmes
Other	4	Conditioning, common mistakes, coaching, FAQ

Current status: functional MVP

✓ Technically validated

Tip timing (~18 s cadence)
Fallback pool (65 pre-written tips)
Cold-start fix (lazy loading at startup)
GPU coexistence with CV pipeline
4-layer failure recovery

○ Not yet validated

Whether tips are genuinely useful
Whether the knowledge base covers enough scenarios
Whether advice depth matches user skill level

Next step: a structured pilot with boxers across beginner, intermediate, and advanced levels. Results would feed back into the knowledge base, system prompts, and inference parameters to move the coach from functional MVP to validated training tool.

4. What if it fails?, reliability and fallback

Everything above describes the coach when it works. The harder problem is what happens when it doesn't, the LLM hangs under GPU contention, the model file is missing, generation runs out of memory mid-response. Four layers of defence ensure the user always sees something useful instead of an error or a dead UI:

Response quality

128-token cap prevents mid-sentence cut-offs, an explicit "keep tips SHORT" instruction is in the system prompt, and markdown is stripped from outputs to remove formatting artefacts.

Timeout protection

A 20-second hard timeout runs on a background thread, if inference hangs, the coach fails over to a fallback tip instead of blocking the rest of the system.

Fallback tips

65 pre-written tips across 4 categories (technique, encouragement, correction, suggestion) are served when the LLM is unavailable, the model file is missing, or GPU memory runs out.

Auto-recovery

The system tracks consecutive failures and attempts a model reload after 3 failures in a row, transient GPU contention recovers itself without intervention.

Together the four steps above, constraint, model selection, voice, and fallback , turn a generic edge LLM into a domain-specific coach that streams useful tips to the user during real training sessions, on a single Jetson, with no internet, and degrades gracefully when anything goes wrong.

5. Sharing the GPU, coexisting with the CV pipeline

Frame rate, compute, and output stability (covered in 5.3.2.2) all assume the CV pipeline owns the GPU. In practice it shares the Jetson with the LLM coach, the touchscreen GUI, and the ROS messaging layer. Two mechanisms keep them out of each other's way:

Adaptive frame rate. The cv_node runs at the full 30 fps only during active training sessions. Outside of active rounds (idle, countdown, rest periods) it drops to ~6 fps, freeing the GPU for LLM inference exactly when the user is reading a coaching tip rather than throwing punches.

Shared-memory footprint. The Jetson Orin NX has 16 GB of unified memory shared between CPU and GPU, so every model on the Jetson is drawing from the same pool. Knowing the exact footprint of each component matters because anything that exceeds the pool starts paging to disk and the real-time loop collapses. The footprint below has been measured under live load:

Component	Memory
Action model (TensorRT)	~200MB
YOLO Pose (TensorRT)	~150MB
Gemma 4 E2B LLM	~3.1GB
Frame buffers	~100MB
Total	~3.6GB / 16GB

Why 16 GB of RAM does not mean we can run any LLM

The Jetson has 16 GB of unified memory shared between CPU and GPU, with only ~3.6 GB used. But LLM speed is limited by memory bandwidth (how fast data moves), not capacity (how much fits). Every token generated requires reading the full model weights once:

The system originally ran Qwen2.5-3B-Instruct (~2 GB, text-only). Within months, Google released Gemma 4 E2B (April 2026), an edge-optimised model with a Per-Layer Embedding architecture that packs 5.1B total parameters into just 2.3B active parameters per inference. Despite being a more capable model with multimodal support (text + image), it fits in ~3.1 GB and runs at 16–20 tok/s on the same Jetson, fast enough for real-time coaching tips. The rapid pace of edge LLM development means this constraint continues to relax with each generation of models.

5.3.2.5 Session Management

RI-5

Every other node on this page, fusion, inference, sparring, LLM, needs to know whether a training session is currently active, resting, or idle. Without a single authority for that, each node would have to guess from its own data, drift out of sync with the others, and either keep running compute it doesn't need (wasting GPU) or miss data it should be capturing. The session_manager node is the single source of truth: it owns the lifecycle, broadcasts the current state at 2 Hz as a heartbeat, and every other node treats that heartbeat as the truth.

The 5 states and what each one triggers

Every training session, regardless of mode (techniques, sparring, free, performance tests), flows through the same 5 states. Each transition isn't just a label change, it triggers coordinated side effects across the rest of the system:

1. Idle

Default state between sessions. CV runs at the idle ~6 fps (see adaptive frame rate above). imu_node switches to navigation mode (pad hits become menu navigation, not punch events). No data is recorded.

2. Countdown

3-second pre-round window. The user can adjust the robot's height via the phone or GUI before the round begins, the audio cue plays, and the sparring engine and drill manager arm themselves but don't fire yet.

3. Active

CV ramps to full 30 fps, imu_node switches to training mode, punch_processor starts emitting confirmed punches, the sparring engine starts attacking, and the LLM coach begins emitting tips every ~18 s. All session statistics accumulate.

4. Rest

Between rounds. CV drops back to ~6 fps, the sparring engine pauses, the LLM coach gets the recent round summary as context for its next batch of tips. Round counter advances and the cycle returns to active for the next round.

5. Complete

Final round finished. The full session summary is written to the per-user SQLite database (see Section 5.3.3), the LLM coach is asked for its post-session analysis, and the GUI transitions to the results page. After a brief moment the state machine returns to idle.

Configurable cycle & session summary

The default cycle runs 3 rounds of 180 s work + 60 s rest, but the round count, work duration, and rest duration are all configurable per session via the SessionConfig message. The same 5-state machine drives every training mode, so the GUI, the dashboard, and every ROS node only ever have to handle one lifecycle.

When the session reaches the complete state, the session_manager writes a comprehensive summary to the user's database that powers every analytics view on the Dashboard:

Punch metrics, total count, distribution by punch type, peak and average force, punches per minute
Defence metrics, defence rate, breakdown by defence type (block / slip / dodge / hit), reaction times
Movement metrics, max lateral and depth displacement, movement timeline, person direction changes
Coaching context, the round-by-round LLM tips emitted during the session and the post-session analysis text

This summary is what makes the per-user database useful. Without it, every analytics view, trend chart, and personal record on the dashboard would have nothing to read from.

Back: CV & Action Prediction

Next: Dashboard & Analytics