Back to Robot Intelligence

5.3.2 Intelligent Behaviour & Control

RI-1, RI-2, RI-4 to RI-8
Fusion Inference Sparring LLM Sessions

All intelligent behaviour runs as a network of ROS 2 Humble nodes on the Jetson, communicating via custom message types and services. Together they form the decision-making layer between the CV perception pipeline (5.3.1) and the dashboard (5.3.3).

5.3.2.1 CV + IMU Sensor Fusion

RI-4

The CV model on its own says "the user just threw a left hook". The IMU pad on its own says "something hit me hard". Neither is enough, CV can hallucinate punches into thin air, and the IMU can't tell a hook from a cross. The punch_processor node fuses the two into confirmed punch events using three layers of logic: temporal matching, biomechanical filtering, and graceful fallback when one sensor fails.

1. Temporal matching, lining up CV and IMU in time

When an IMU pad impact exceeds 5.0 m/s² (after gravity calibration), the node queries the CV prediction ring buffer within a ±500 ms window centred on the impact timestamp. Frame-count voting selects the dominant prediction type within that window, and the result is emitted as a confirmed punch carrying both the CV confidence and the IMU force.

2. Pad-constraint filtering, rejecting biomechanically impossible matches

Even with a clean temporal match, some CV predictions are physically impossible, a left hook cannot land on the right pad. Each pad therefore only accepts the punch types that could biomechanically reach it:

If the primary CV prediction violates this constraint, secondary predictions are checked against the same rules with a minimum confidence threshold of 0.25, this is what catches the model's retraction-phase wobble that 5.3.4.8 Limitations refers to.

3. Fallbacks, graceful degradation when one sensor fails

Both layers above assume CV and IMU are both producing usable data. When one of them fails, the camera obscured, a pad disconnected, the node still emits a best-effort punch event so downstream nodes never starve:

The pipeline diagram below ties everything together, CV predictions and IMU pad impacts both flow into the punch_processor's ring buffer, get matched within the ±500 ms window, pass through the pad-constraint filter, and finally emerge as a confirmed punch event published to the rest of the system.

5.3.2.2 Real-Time Inference Pipeline

RI-1, RI-2, RI-5

Getting the action model from "works on a recorded clip" to "works live on the robot" required four things in sequence: a frame rate the model could trust, enough per-frame compute to actually hit it, output that didn't flicker between classes, and a resource budget that left room for the LLM coach to run on the same GPU. Each subsection below tackles one of those four problems, in order.

1. Frame rate, why 30fps matters

The action prediction model was trained on features extracted at a consistent 30fps. Its voxel temporal deltas use 2-frame (67ms) and 8-frame (267ms) lookbacks, if the actual frame rate drops below 30fps, these deltas span different real-world time intervals than what the model learned, causing the model to misclassify punches. In other words, the action model's accuracy is directly tied to maintaining 30fps, even a small drop degrades predictions significantly.

2. Per-frame compute, the YOLO bottleneck

Knowing 30fps was non-negotiable, the next question was whether the pipeline could actually hit it on the Jetson. The answer at first was no, and the bottleneck was a single component:

YOLO bottleneck on the Jetson

Problem: the single biggest deployment challenge was YOLO Pose. In its default PyTorch (.pt) format, YOLO alone consumed most of the 33 ms per-frame budget, dragging the whole pipeline below 30 fps. Since YOLO provides the pose features the action model depends on, this meant the action model was being fed features at inconsistent, lower frame rates, accurate offline, broken live.

Fix: convert YOLO to a TensorRT engine (.engine format, FP16 precision, half-precision 16-bit floats, which run roughly twice as fast on GPU as the default 32-bit format with negligible accuracy loss). TensorRT optimises the computation graph for the Jetson's GPU, fusing operations and eliminating overhead. This brought YOLO inference down from ~33 ms to ~16 ms, comfortably under the frame budget.

Result: the action model became accurate and reliable in live deployment immediately , confirming the accuracy issue was never the model, only the frame rate feeding it.

Why does this speedup matter?

Both models run in series every frame, YOLO extracts the pose features the action model immediately consumes, and together they have to fit inside the 33 ms frame budget to hold 30 fps. Because the action model is small (~1.75M parameters) and was already running at ~8 ms, the entire budget effectively came down to how fast YOLO could run. That makes the TensorRT speedup the deciding factor between a pipeline that holds 30 fps and one that doesn't.

YOLO POSE CONVERSION Ultralytics ships native TensorRT export, one-shot conversion, no intermediate file needed yolo26n-pose.pt Ultralytics weights model.export(format="engine", half=True) yolo26n-pose.engine TensorRT FP16 engine INFERENCE LATENCY ~33 ms~16 ms 2× faster

The chart below shows how that single conversion turns a frame that missed the 33 ms budget into one that fits with headroom to spare:

Per-frame inference budget, 33 ms ceiling for 30 fps 33 ms ceiling BEFORE PyTorch YOLO ~33 ms Action ~8 ms ~41 ms  ✕ misses 30 fps AFTER TensorRT FP16 YOLO ~16 ms Action ~8 ms ~9 ms headroom ~24 ms  ✓ fits 30 fps Action model is in series with YOLO, it consumes YOLO's pose output every frame, so any time YOLO spends over budget directly drags the action model below 30 fps. The TensorRT speedup is what makes the whole pipeline real-time.

What does TensorRT actually do?

Conversion is more than a file rename. When TensorRT compiles a model for a specific Jetson, it does four things:

Layer fusion
Adjacent operations (e.g. conv + bias + ReLU) are fused into a single GPU kernel, eliminating intermediate memory writes.
Kernel autotuning
For each fused operation, TensorRT benchmarks several CUDA implementations on the actual hardware and picks the fastest one.
FP16 quantisation
Every weight is converted from 32-bit to 16-bit floating point, halving the memory bandwidth with negligible accuracy loss.
Engine serialisation
The optimised plan is serialised to a single binary file that loads in milliseconds, no recompilation on subsequent runs.

The result is the same model mathematically running with roughly half the memory bandwidth and a fraction of the kernel launch overhead, exactly what the inference pipeline needed to sustain 30 fps.

Three more compute-side issues

The TensorRT speedup got the per-frame budget under control, but three secondary issues stood between offline accuracy and live performance:

Sequential processing
Issue: voxel extraction and YOLO ran one after the other, doubling per-frame time.
Fix: parallelised into separate worker threads, running concurrently.
Buffer discontinuities
Issue: dropped frames created gaps in the temporal window the model relied on.
Fix: rolling FIFO feature buffer with strict frame ordering.
FPS mismatch
Issue: trained at a steady 30 fps but the live pipeline ran at variable rates.
Fix: TensorRT optimisation of both models gave consistent throughput.

Final per-frame budget

With the TensorRT speedup and the three compute fixes in place, here is what a single frame actually costs end-to-end:

Component Latency Notes
YOLO Pose (TensorRT) ~16ms On GPU, single frame
Depth → voxel extraction ~10ms Background subtraction + voxelisation
YOLO + voxel in parallel ~16ms Separate threads, limited by slower
Model forward pass ~8ms ~1.75M params, d=192
Total per frame ~24ms 42fps theoretical, 30fps practical

The 24ms-per-frame inference budget is the CV pipeline only. The full end-to-end latency from camera capture to motor command is ~120ms, the remaining ~96ms covers RealSense capture buffering, ROS 2 message passing through punch_processor, the ±500ms fusion window's earliest-match logic, and the motor command queue. This is verified against requirement RI‑2 (≤150ms) in 5.3.4.7.

3. Output stability, cleaning up noisy predictions

A stable 30fps gets the model the inputs it expects, but the outputs still need work. Even on perfectly clean frames, the model's per-frame top class can flicker between two similar punches for a few frames at a time. Publishing every raw prediction would generate spurious events, so three filtering stages stand between the model and the rest of the system:

EMA smoothing, α = 0.35
An exponential moving average over per-class confidences damps out frame-to-frame jitter so a single noisy frame can't flip the prediction.
Hysteresis gate, 12% confidence margin
The new top class must beat the current top class by 12% confidence before the prediction is allowed to switch, this prevents two close classes from oscillating.
State machine, 2 frames to enter, 3 to hold, ≥0.78 sustain (block: 4)
A class must hold for at least 2 frames at ≥0.78 confidence before it is committed, and 3 more frames to stay committed. Block requires 4 consecutive frames because of its biomechanical similarity to uppercut onset.

4. Resource sharing, coexisting with the LLM

Frame rate, compute, and output stability all assume the CV pipeline owns the GPU. In practice it shares the Jetson with the LLM coach, the touchscreen GUI, and the ROS messaging layer. Two mechanisms keep them out of each other's way:

Adaptive frame rate. The cv_node runs at the full 30 fps only during active training sessions. Outside of active rounds , idle, countdown, rest periods, it drops to ~6 fps, freeing the GPU for LLM inference exactly when the user is reading a coaching tip rather than throwing punches.

Shared-memory footprint. The Jetson Orin NX has 16 GB of unified memory shared between CPU and GPU, so every model on the Jetson is drawing from the same pool. Knowing the exact footprint of each component matters because anything that exceeds the pool starts paging to disk and the real-time loop collapses. The footprint below has been measured under live load:

Component Memory
Action model (TensorRT) ~200MB
YOLO Pose (TensorRT) ~150MB
Gemma 4 E2B LLM ~3.1GB
Frame buffers ~100MB
Total ~3.6GB / 16GB
Why 16 GB of RAM does not mean we can run any LLM

The Jetson has 16 GB of unified memory shared between CPU and GPU, with only ~3.6 GB used. But LLM speed is limited by memory bandwidth (how fast data moves), not capacity (how much fits). Every token generated requires reading the full model weights once:

Capacity vs. Bandwidth on the Jetson Orin NX (16 GB unified LPDDR5) CAPACITY how much fits 3.6 GB 12.4 GB free 16 GB A 7B model (~4 GB) easily fits. Capacity is not the problem. BANDWIDTH speed limit Gemma 4 E2B: reads ~3 GB/token = ~34 tok/s max CV headroom 102 GB/s 7B: reads 4 GB/token = ~25 tok/s max, saturates entire bus 102 GB/s Model Size (Q4) Fits in 16 GB? Tokens/s (with CV) Fast enough for tips? Gemma 4 E2B ~3.1 GB Yes ~16-20 tok/s ✓ Yes 7B-8B model ~4 GB Yes <15 tok/s ✕ Too slow

The system originally ran Qwen2.5-3B-Instruct (~2 GB, text-only). Within months, Google released Gemma 4 E2B (April 2026), an edge-optimised model with a Per-Layer Embedding architecture that packs 5.1B total parameters into just 2.3B active parameters per inference. Despite being a more capable model with multimodal support (text + image), it fits in ~3.1 GB and runs at 16–20 tok/s on the same Jetson, fast enough for real-time coaching tips. The rapid pace of edge LLM development means this constraint continues to relax with each generation of models.

With frame rate, compute, output stability, and resource sharing all handled, the action model now runs reliably in real time, the foundation that the rest of this section builds on top of. Licensing implications of the tech stack (including YOLO's AGPL-3.0 licence) are addressed on the landing page.

5.3.2.3 AI Sparring Engine

RI-6

The 5 sparring styles and their Markov chain combo generation are introduced in 5.1 GUI. This section covers the ROS-side execution: how sparring_engine turns those style definitions into live robot behaviour. Five jobs run in a loop every round:

  1. Pace attacks at the right intensity (difficulty scaling)
  2. Adapt to the user's weak areas (weakness tracking)
  3. Stay safe when something else in the system goes wrong (engine safeguards)
  4. Read the user's response to each strike (defence detection)
  5. Reactive mode for unstructured practice (free training)

1. Pacing attacks, difficulty scaling

Parameter Easy Medium Hard
Attack interval 2.0s 1.2s 0.7s
Counter-punch probability 30% 50% 80%
Arm speed 8 rad/s 15 rad/s 25 rad/s

2. Adapting to the user, weakness tracking

Difficulty levels alone make sparring harder; they don't make it smarter. The sparring_weakness_profile database table records, for every punch type, how often the user successfully blocked it versus how often they were hit. These stats persist across sessions, so returning to the robot the next day continues from where the user left off.

Example: A user blocks 90% of jabs but only 40% of hooks. The engine will throw more hooks and fewer jabs in subsequent rounds, forcing the user to practise defending the punch type they struggle with most.

Over time the robot stops feeding the user openings they already handle well and concentrates on the gaps, forcing continuous improvement instead of comfortable repetition.

3. Staying safe, engine safeguards

A relentless attack engine that doesn't know when to stop is dangerous, both to the user and to the motors. Four runtime safeguards keep the engine safe and well-behaved when something elsewhere in the system goes wrong:

Single strike at a time
A robot_busy flag blocks any new RobotCommand while the previous strike is still executing, no motor command pile-ups.
Session watchdog
Listens for the SessionState heartbeat (2 Hz) and deactivates itself if none arrives for 5 seconds, a crashed session_manager never leaves the robot free-swinging.
Block reaction
If the user successfully blocked the last strike, the engine forces a different punch type for the next attack, the AI never feeds the same opening twice in a row.
Idle surprise
If the user goes idle for more than 3 seconds, the engine fires the next attack at 60% of the normal interval, breaking up rest periods that creep into a round.

4. Reading the user's response, defence detection

The weakness profile from #2 only works if the engine actually knows whether the user defended a strike or got hit by it. Every time the robot throws a punch, a 500 ms detection window opens and the punch_processor classifies the user's response using a priority-based system, physical contact wins over visual cues, but visual cues catch the cases where the user dodged cleanly:

Priority Response Detection Method
1 Hit Arm IMU detects physical contact, the robot's punch landed
2 Block CV model detects block pose (confidence ≥ 0.3)
3 Slip Large displacement, user's bounding box moved ≥40px laterally or ≥0.15m in depth
4 Dodge Moderate displacement, ≥20px laterally or ≥0.08m in depth
5 Unknown No detectable response within the 500ms window

Defence results feed into the sparring analytics (defence rate, breakdown by type) and the weakness tracking system, which uses them to target the user's weakest areas in future sessions.

5. Reactive mode, free training

Sparring mode is structured: the robot drives, the user defends. Free training flips the loop, the user drives, the robot reacts. Whenever the user hits a pad, the robot fires back a contextually appropriate counter-punch instead of following a scripted sequence. This is the simplest engine in the section because there's no Markov chain, no weakness profile, and no difficulty scaling, just a pad-to-counter mapping:

User Hits Robot Responds With
Centre pad Jab or Cross
Left pad Left Hook or Left Uppercut
Right pad Right Hook or Right Uppercut
Head pad Jab or Cross

A 300ms cooldown prevents rapid-fire counters, and the robot returns to guard position after 5 seconds of inactivity.

5.3.2.4 LLM Coach

RI-7

A local AI coach runs entirely on the Jetson, no cloud, no internet. Building one that's actually useful inside a real-time boxing system means answering four questions in order: why local at all, which model fits, how to make it sound like a coach, and what happens when it fails. The four subsections below tackle each in turn.

1. Why local?, the constraint that shapes everything else

Cloud-based AI coaching would add network latency and require stable internet, neither of which is guaranteed in a gym environment. Running the LLM on-device addresses three problems at once:

Latency
Tips return in ~0.5–2 s with no network round-trip.
Reliability
Coaching keeps working when the gym WiFi doesn't.
Privacy
Training data never leaves the robot.

Committing to local raises the obvious follow-up: which model is small enough to fit alongside the CV pipeline on the Jetson but capable enough to actually coach?

2. Which model fits?, selection & deployment

Model Gemma 4 E2B (Google, April 2026), edge-optimised with Per-Layer Embeddings, 5.1B total / 2.3B active parameters, multimodal (text + image)
Quantisation GGUF Q4_K_M (~3.1GB), fits alongside CV models in shared GPU memory
Runtime llama-cpp-python with full GPU offload
Response time 0.5–2 seconds per response
Context window 2,048 tokens
Generation limit 128 tokens for real-time tips; 100/256/512 tokens for chat (short/normal/detailed)

When does the coach actually speak?

Picking Gemma 4 E2B answers the “what runs” question, but a model loaded in memory is just an idle model. The engine wakes the LLM in three different situations, each with its own context payload and output style:

Mode When What It Does
Real-time tips Every ~18 seconds during active rounds 1–2 sentence coaching tips based on recent punches, combo progress, and session stats. Displayed on the GUI's CoachTipBar widget.
Post-session analysis After session completes 2–3 paragraph detailed analysis referencing specific metrics (punch count, power trends, weak areas). Can suggest specific drills via structured tags.
Interactive chat Anytime via phone dashboard Full conversation with the AI coach. Users can ask technique questions, request training plans, or get explanations of their stats. Word-by-word streaming display.

The system also includes personality modes to keep interactions engaging, an advice mode for motivational pep talks and encouragement, and a Singlish mode that responds in Singaporean colloquial English, adding local flavour to the coaching experience. These modes are selectable via the chat interface and demonstrate the flexibility of the system prompt approach.

3. Sounding like a coach, prompt, parameters, and knowledge base

A general-purpose LLM doesn't know boxing out of the box and doesn't sound like a coach unless you tell it to. Three pieces of configuration shape its voice and its domain expertise: a system prompt that defines its persona, a tuned parameter set that balances quality against the real-time loop, and a curated knowledge base that grounds its advice in real boxing technique.

System prompt

A 24-line embedded system prompt instructs the model to act as “BoxBunny AI Coach” with knowledge of all 6 punch types and their codes, 5 boxing styles (European, Russian, American, Cuban, hybrid), and AIBA methodology (AIBA, 2015). The prompt enforces plain text output (no markdown formatting) and adjusts advice depth to the user's skill level (beginner / intermediate / advanced). Different system prompt keys are used for each of the three coaching modes above.

Inference parameters

Each parameter below was tuned during development to balance response quality, speed, and the real-time constraint that the LLM cannot block the rest of the system:

Parameter Value Why This Value
Temperature 0.7 Balances variety with coherence, lower values produced repetitive tips, higher values caused off-topic responses
Max tokens 128 Keeps real-time tips to 1–2 sentences, prevents mid-sentence cut-offs within the generation budget
Context window 2,048 Large enough to fit system prompt + session stats + knowledge base context without exceeding memory
GPU layers All (-1) Full GPU offload for fastest inference, CPU fallback too slow for real-time tips
Inference timeout 20s If inference hangs (GPU contention, model issue), the system serves a fallback tip instead of waiting indefinitely

A standalone PySide6 chat GUI (llm_chat_gui.py) was built for iterative prompt tuning. It supports streaming token display, model selection, configurable system prompts, and sentence completion detection, used extensively to refine coaching quality before deployment. The interface is shown below.

Boxing knowledge base

The system prompt and parameters tell the model how to talk; the knowledge base tells it what to talk about. 17 documents are injected into the context during coaching analysis so the LLM's advice is grounded in real boxing expertise rather than general-purpose hallucination:

Category Docs Content
Techniques 5 Jab, cross, hooks, uppercuts, stance & guard
Defence 3 Blocking, slipping, footwork
Combinations 3 Beginner to advanced progressions
Training plans 2 8-week beginner & intermediate programmes
Other 4 Conditioning, common mistakes, coaching, FAQ

4. What if it fails?, reliability and fallback

Everything above describes the coach when it works. The harder problem is what happens when it doesn't, the LLM hangs under GPU contention, the model file is missing, generation runs out of memory mid-response. Four layers of defence ensure the user always sees something useful instead of an error or a dead UI:

Response quality
128-token cap prevents mid-sentence cut-offs, an explicit "keep tips SHORT" instruction is in the system prompt, and markdown is stripped from outputs to remove formatting artefacts.
Timeout protection
A 20-second hard timeout runs on a background thread, if inference hangs, the coach fails over to a fallback tip instead of blocking the rest of the system.
Fallback tips
65 pre-written tips across 4 categories (technique, encouragement, correction, suggestion) are served when the LLM is unavailable, the model file is missing, or GPU memory runs out.
Auto-recovery
The system tracks consecutive failures and attempts a model reload after 3 failures in a row, transient GPU contention recovers itself without intervention.

Together the four steps above, constraint, model selection, voice, and fallback , turn a generic edge LLM into a domain-specific coach that streams useful tips to the user during real training sessions, on a single Jetson, with no internet, and degrades gracefully when anything goes wrong.

5.3.2.5 Session Management

RI-5, RI-8

Every other node on this page, fusion, inference, sparring, LLM, needs to know whether a training session is currently active, resting, or idle. Without a single authority for that, each node would have to guess from its own data, drift out of sync with the others, and either keep running compute it doesn't need (wasting GPU) or miss data it should be capturing. The session_manager node is the single source of truth: it owns the lifecycle, broadcasts the current state at 2 Hz as a heartbeat, and every other node treats that heartbeat as the truth.

The 5 states and what each one triggers

Every training session, regardless of mode (techniques, sparring, free, performance tests), flows through the same 5 states. Each transition isn't just a label change, it triggers coordinated side effects across the rest of the system:

1. Idle
Default state between sessions. CV frame rate drops to ~6 fps to free GPU headroom for the LLM. imu_node switches to navigation mode (pad hits become menu navigation, not punch events). No data is recorded.
2. Countdown
3-second pre-round window. The user can adjust the robot's height via the phone or GUI before the round begins, the audio cue plays, and the sparring engine and drill manager arm themselves but don't fire yet.
3. Active
CV ramps to full 30 fps, imu_node switches to training mode, punch_processor starts emitting confirmed punches, the sparring engine starts attacking, and the LLM coach begins emitting tips every ~18 s. All session statistics accumulate.
4. Rest
Between rounds. CV drops back to ~6 fps, the sparring engine pauses, the LLM coach gets the recent round summary as context for its next batch of tips. Round counter advances and the cycle returns to active for the next round.
5. Complete
Final round finished. The full session summary is written to the per-user SQLite database (see Section 5.3.3), the LLM coach is asked for its post-session analysis, and the GUI transitions to the results page. After a brief moment the state machine returns to idle.

Configurable cycle & session summary

The default cycle runs 3 rounds of 180 s work + 60 s rest, but the round count, work duration, and rest duration are all configurable per session via the SessionConfig message. The same 5-state machine drives every training mode, so the GUI, the dashboard, and every ROS node only ever have to handle one lifecycle.

When the session reaches the complete state, the session_manager writes a comprehensive summary to the user's database that powers every analytics view on the Dashboard:

This summary is what makes the per-user database useful. Without it, every analytics view, trend chart, and personal record on the dashboard would have nothing to read from.