5.3.2 Intelligent Behaviour & Control
All intelligent behaviour runs as a network of ROS 2 Humble nodes on the Jetson, communicating via custom message types and services. Together they form the decision-making layer between the CV perception pipeline (5.3.1) and the dashboard (5.3.3).
5.3.2.1 CV + IMU Sensor Fusion
The CV model on its own says "the user just threw a left hook". The IMU pad on its
own says "something hit me hard". Neither is enough, CV can hallucinate
punches into thin air, and the IMU can't tell a hook from a cross. The
punch_processor node fuses the two into confirmed punch events
using three layers of logic: temporal matching, biomechanical filtering, and graceful
fallback when one sensor fails.
1. Temporal matching, lining up CV and IMU in time
When an IMU pad impact exceeds 5.0 m/s² (after gravity calibration), the node queries the CV prediction ring buffer within a ±500 ms window centred on the impact timestamp. Frame-count voting selects the dominant prediction type within that window, and the result is emitted as a confirmed punch carrying both the CV confidence and the IMU force.
2. Pad-constraint filtering, rejecting biomechanically impossible matches
Even with a clean temporal match, some CV predictions are physically impossible, a left hook cannot land on the right pad. Each pad therefore only accepts the punch types that could biomechanically reach it:
- Centre pad: jab and cross only
- Left pad: left hook and left uppercut only
- Right pad: right hook and right uppercut only
- Head pad: any punch type
If the primary CV prediction violates this constraint, secondary predictions are checked against the same rules with a minimum confidence threshold of 0.25, this is what catches the model's retraction-phase wobble that 5.3.4.8 Limitations refers to.
3. Fallbacks, graceful degradation when one sensor fails
Both layers above assume CV and IMU are both producing usable data. When one of them fails, the camera obscured, a pad disconnected, the node still emits a best-effort punch event so downstream nodes never starve:
- CV-only mode: ≥3 consecutive frames at ≥0.6 confidence, emitted with a 0.3 penalty
- IMU-only mode: default 0.3 confidence, punch type inferred from pad location
The pipeline diagram below ties everything together, CV predictions and IMU pad
impacts both flow into the punch_processor's ring buffer, get matched
within the ±500 ms window, pass through the pad-constraint filter, and finally
emerge as a confirmed punch event published to the rest of the system.
5.3.2.2 Real-Time Inference Pipeline
Getting the action model from "works on a recorded clip" to "works live on the robot" required four things in sequence: a frame rate the model could trust, enough per-frame compute to actually hit it, output that didn't flicker between classes, and a resource budget that left room for the LLM coach to run on the same GPU. Each subsection below tackles one of those four problems, in order.
1. Frame rate, why 30fps matters
The action prediction model was trained on features extracted at a consistent 30fps. Its voxel temporal deltas use 2-frame (67ms) and 8-frame (267ms) lookbacks, if the actual frame rate drops below 30fps, these deltas span different real-world time intervals than what the model learned, causing the model to misclassify punches. In other words, the action model's accuracy is directly tied to maintaining 30fps, even a small drop degrades predictions significantly.
2. Per-frame compute, the YOLO bottleneck
Knowing 30fps was non-negotiable, the next question was whether the pipeline could actually hit it on the Jetson. The answer at first was no, and the bottleneck was a single component:
Problem:
the single biggest deployment challenge was YOLO Pose. In its default PyTorch
(.pt) format, YOLO alone consumed most of the 33 ms per-frame
budget, dragging the whole pipeline below 30 fps. Since YOLO provides the pose
features the action model depends on, this meant the action model was being fed
features at inconsistent, lower frame rates, accurate offline, broken live.
Fix:
convert YOLO to a TensorRT engine (.engine format,
FP16 precision, half-precision 16-bit floats, which run roughly twice as
fast on GPU as the default 32-bit format with negligible accuracy loss). TensorRT
optimises the computation graph for the Jetson's GPU, fusing operations and
eliminating overhead. This brought YOLO inference down from
~33 ms to ~16 ms, comfortably under the frame budget.
Result: the action model became accurate and reliable in live deployment immediately , confirming the accuracy issue was never the model, only the frame rate feeding it.
Why does this speedup matter?
Both models run in series every frame, YOLO extracts the pose features the action model immediately consumes, and together they have to fit inside the 33 ms frame budget to hold 30 fps. Because the action model is small (~1.75M parameters) and was already running at ~8 ms, the entire budget effectively came down to how fast YOLO could run. That makes the TensorRT speedup the deciding factor between a pipeline that holds 30 fps and one that doesn't.
The chart below shows how that single conversion turns a frame that missed the 33 ms budget into one that fits with headroom to spare:
What does TensorRT actually do?
Conversion is more than a file rename. When TensorRT compiles a model for a specific Jetson, it does four things:
The result is the same model mathematically running with roughly half the memory bandwidth and a fraction of the kernel launch overhead, exactly what the inference pipeline needed to sustain 30 fps.
Three more compute-side issues
The TensorRT speedup got the per-frame budget under control, but three secondary issues stood between offline accuracy and live performance:
Final per-frame budget
With the TensorRT speedup and the three compute fixes in place, here is what a single frame actually costs end-to-end:
| Component | Latency | Notes |
|---|---|---|
| YOLO Pose (TensorRT) | ~16ms | On GPU, single frame |
| Depth → voxel extraction | ~10ms | Background subtraction + voxelisation |
| YOLO + voxel in parallel | ~16ms | Separate threads, limited by slower |
| Model forward pass | ~8ms | ~1.75M params, d=192 |
| Total per frame | ~24ms | 42fps theoretical, 30fps practical |
The 24ms-per-frame inference budget is the CV pipeline only. The full end-to-end latency
from camera capture to motor command is ~120ms, the remaining ~96ms covers
RealSense capture buffering, ROS 2 message passing through punch_processor,
the ±500ms fusion window's earliest-match logic, and the motor command queue.
This is verified against requirement RI‑2 (≤150ms) in
5.3.4.7.
3. Output stability, cleaning up noisy predictions
A stable 30fps gets the model the inputs it expects, but the outputs still need work. Even on perfectly clean frames, the model's per-frame top class can flicker between two similar punches for a few frames at a time. Publishing every raw prediction would generate spurious events, so three filtering stages stand between the model and the rest of the system:
- EMA smoothing, α = 0.35
- An exponential moving average over per-class confidences damps out frame-to-frame jitter so a single noisy frame can't flip the prediction.
- Hysteresis gate, 12% confidence margin
- The new top class must beat the current top class by 12% confidence before the prediction is allowed to switch, this prevents two close classes from oscillating.
- State machine, 2 frames to enter, 3 to hold, ≥0.78 sustain (block: 4)
- A class must hold for at least 2 frames at ≥0.78 confidence before it is committed, and 3 more frames to stay committed. Block requires 4 consecutive frames because of its biomechanical similarity to uppercut onset.
4. Resource sharing, coexisting with the LLM
Frame rate, compute, and output stability all assume the CV pipeline owns the GPU. In practice it shares the Jetson with the LLM coach, the touchscreen GUI, and the ROS messaging layer. Two mechanisms keep them out of each other's way:
Adaptive frame rate. The cv_node runs at the full
30 fps only during active training sessions. Outside of active rounds
, idle, countdown, rest periods, it drops to ~6 fps, freeing the
GPU for LLM inference exactly when the user is reading a coaching tip rather than
throwing punches.
Shared-memory footprint. The Jetson Orin NX has 16 GB of unified memory shared between CPU and GPU, so every model on the Jetson is drawing from the same pool. Knowing the exact footprint of each component matters because anything that exceeds the pool starts paging to disk and the real-time loop collapses. The footprint below has been measured under live load:
| Component | Memory |
|---|---|
| Action model (TensorRT) | ~200MB |
| YOLO Pose (TensorRT) | ~150MB |
| Gemma 4 E2B LLM | ~3.1GB |
| Frame buffers | ~100MB |
| Total | ~3.6GB / 16GB |
The Jetson has 16 GB of unified memory shared between CPU and GPU, with only ~3.6 GB used. But LLM speed is limited by memory bandwidth (how fast data moves), not capacity (how much fits). Every token generated requires reading the full model weights once:
The system originally ran Qwen2.5-3B-Instruct (~2 GB, text-only). Within months, Google released Gemma 4 E2B (April 2026), an edge-optimised model with a Per-Layer Embedding architecture that packs 5.1B total parameters into just 2.3B active parameters per inference. Despite being a more capable model with multimodal support (text + image), it fits in ~3.1 GB and runs at 16–20 tok/s on the same Jetson, fast enough for real-time coaching tips. The rapid pace of edge LLM development means this constraint continues to relax with each generation of models.
With frame rate, compute, output stability, and resource sharing all handled, the action model now runs reliably in real time, the foundation that the rest of this section builds on top of. Licensing implications of the tech stack (including YOLO's AGPL-3.0 licence) are addressed on the landing page.
5.3.2.3 AI Sparring Engine
The 5 sparring styles and their Markov chain combo generation are introduced in
5.1 GUI. This section covers the ROS-side execution:
how sparring_engine turns those style definitions into live robot behaviour.
Five jobs run in a loop every round:
- Pace attacks at the right intensity (difficulty scaling)
- Adapt to the user's weak areas (weakness tracking)
- Stay safe when something else in the system goes wrong (engine safeguards)
- Read the user's response to each strike (defence detection)
- Reactive mode for unstructured practice (free training)
1. Pacing attacks, difficulty scaling
| Parameter | Easy | Medium | Hard |
|---|---|---|---|
| Attack interval | 2.0s | 1.2s | 0.7s |
| Counter-punch probability | 30% | 50% | 80% |
| Arm speed | 8 rad/s | 15 rad/s | 25 rad/s |
2. Adapting to the user, weakness tracking
Difficulty levels alone make sparring harder; they don't make it smarter.
The sparring_weakness_profile database table records, for every punch
type, how often the user successfully blocked it versus how often they were hit.
These stats persist across sessions, so returning to the robot the next day
continues from where the user left off.
Over time the robot stops feeding the user openings they already handle well and concentrates on the gaps, forcing continuous improvement instead of comfortable repetition.
3. Staying safe, engine safeguards
A relentless attack engine that doesn't know when to stop is dangerous, both to the user and to the motors. Four runtime safeguards keep the engine safe and well-behaved when something elsewhere in the system goes wrong:
robot_busy flag blocks any new RobotCommand while the previous strike is still executing, no motor command pile-ups.SessionState heartbeat (2 Hz) and deactivates itself if none arrives for 5 seconds, a crashed session_manager never leaves the robot free-swinging.4. Reading the user's response, defence detection
The weakness profile from #2 only works if the engine actually knows whether the user
defended a strike or got hit by it. Every time the robot throws a punch, a
500 ms detection window opens and the punch_processor classifies
the user's response using a priority-based system, physical contact wins over
visual cues, but visual cues catch the cases where the user dodged cleanly:
| Priority | Response | Detection Method |
|---|---|---|
| 1 | Hit | Arm IMU detects physical contact, the robot's punch landed |
| 2 | Block | CV model detects block pose (confidence ≥ 0.3) |
| 3 | Slip | Large displacement, user's bounding box moved ≥40px laterally or ≥0.15m in depth |
| 4 | Dodge | Moderate displacement, ≥20px laterally or ≥0.08m in depth |
| 5 | Unknown | No detectable response within the 500ms window |
Defence results feed into the sparring analytics (defence rate, breakdown by type) and the weakness tracking system, which uses them to target the user's weakest areas in future sessions.
5. Reactive mode, free training
Sparring mode is structured: the robot drives, the user defends. Free training flips the loop, the user drives, the robot reacts. Whenever the user hits a pad, the robot fires back a contextually appropriate counter-punch instead of following a scripted sequence. This is the simplest engine in the section because there's no Markov chain, no weakness profile, and no difficulty scaling, just a pad-to-counter mapping:
| User Hits | Robot Responds With |
|---|---|
| Centre pad | Jab or Cross |
| Left pad | Left Hook or Left Uppercut |
| Right pad | Right Hook or Right Uppercut |
| Head pad | Jab or Cross |
A 300ms cooldown prevents rapid-fire counters, and the robot returns to guard position after 5 seconds of inactivity.
5.3.2.4 LLM Coach
A local AI coach runs entirely on the Jetson, no cloud, no internet. Building one that's actually useful inside a real-time boxing system means answering four questions in order: why local at all, which model fits, how to make it sound like a coach, and what happens when it fails. The four subsections below tackle each in turn.
1. Why local?, the constraint that shapes everything else
Cloud-based AI coaching would add network latency and require stable internet, neither of which is guaranteed in a gym environment. Running the LLM on-device addresses three problems at once:
Committing to local raises the obvious follow-up: which model is small enough to fit alongside the CV pipeline on the Jetson but capable enough to actually coach?
2. Which model fits?, selection & deployment
| Model | Gemma 4 E2B (Google, April 2026), edge-optimised with Per-Layer Embeddings, 5.1B total / 2.3B active parameters, multimodal (text + image) |
| Quantisation | GGUF Q4_K_M (~3.1GB), fits alongside CV models in shared GPU memory |
| Runtime | llama-cpp-python with full GPU offload |
| Response time | 0.5–2 seconds per response |
| Context window | 2,048 tokens |
| Generation limit | 128 tokens for real-time tips; 100/256/512 tokens for chat (short/normal/detailed) |
When does the coach actually speak?
Picking Gemma 4 E2B answers the “what runs” question, but a model loaded in memory is just an idle model. The engine wakes the LLM in three different situations, each with its own context payload and output style:
| Mode | When | What It Does |
|---|---|---|
| Real-time tips | Every ~18 seconds during active rounds | 1–2 sentence coaching tips based on recent punches, combo progress, and session stats. Displayed on the GUI's CoachTipBar widget. |
| Post-session analysis | After session completes | 2–3 paragraph detailed analysis referencing specific metrics (punch count, power trends, weak areas). Can suggest specific drills via structured tags. |
| Interactive chat | Anytime via phone dashboard | Full conversation with the AI coach. Users can ask technique questions, request training plans, or get explanations of their stats. Word-by-word streaming display. |
The system also includes personality modes to keep interactions engaging, an advice mode for motivational pep talks and encouragement, and a Singlish mode that responds in Singaporean colloquial English, adding local flavour to the coaching experience. These modes are selectable via the chat interface and demonstrate the flexibility of the system prompt approach.
3. Sounding like a coach, prompt, parameters, and knowledge base
A general-purpose LLM doesn't know boxing out of the box and doesn't sound like a coach unless you tell it to. Three pieces of configuration shape its voice and its domain expertise: a system prompt that defines its persona, a tuned parameter set that balances quality against the real-time loop, and a curated knowledge base that grounds its advice in real boxing technique.
System prompt
A 24-line embedded system prompt instructs the model to act as “BoxBunny AI Coach” with knowledge of all 6 punch types and their codes, 5 boxing styles (European, Russian, American, Cuban, hybrid), and AIBA methodology (AIBA, 2015). The prompt enforces plain text output (no markdown formatting) and adjusts advice depth to the user's skill level (beginner / intermediate / advanced). Different system prompt keys are used for each of the three coaching modes above.
Inference parameters
Each parameter below was tuned during development to balance response quality, speed, and the real-time constraint that the LLM cannot block the rest of the system:
| Parameter | Value | Why This Value |
|---|---|---|
| Temperature | 0.7 | Balances variety with coherence, lower values produced repetitive tips, higher values caused off-topic responses |
| Max tokens | 128 | Keeps real-time tips to 1–2 sentences, prevents mid-sentence cut-offs within the generation budget |
| Context window | 2,048 | Large enough to fit system prompt + session stats + knowledge base context without exceeding memory |
| GPU layers | All (-1) | Full GPU offload for fastest inference, CPU fallback too slow for real-time tips |
| Inference timeout | 20s | If inference hangs (GPU contention, model issue), the system serves a fallback tip instead of waiting indefinitely |
A standalone PySide6 chat GUI (llm_chat_gui.py) was built for iterative
prompt tuning. It supports streaming token display, model selection, configurable
system prompts, and sentence completion detection, used extensively to refine
coaching quality before deployment. The interface is shown below.
Boxing knowledge base
The system prompt and parameters tell the model how to talk; the knowledge base tells it what to talk about. 17 documents are injected into the context during coaching analysis so the LLM's advice is grounded in real boxing expertise rather than general-purpose hallucination:
| Category | Docs | Content |
|---|---|---|
| Techniques | 5 | Jab, cross, hooks, uppercuts, stance & guard |
| Defence | 3 | Blocking, slipping, footwork |
| Combinations | 3 | Beginner to advanced progressions |
| Training plans | 2 | 8-week beginner & intermediate programmes |
| Other | 4 | Conditioning, common mistakes, coaching, FAQ |
4. What if it fails?, reliability and fallback
Everything above describes the coach when it works. The harder problem is what happens when it doesn't, the LLM hangs under GPU contention, the model file is missing, generation runs out of memory mid-response. Four layers of defence ensure the user always sees something useful instead of an error or a dead UI:
Together the four steps above, constraint, model selection, voice, and fallback , turn a generic edge LLM into a domain-specific coach that streams useful tips to the user during real training sessions, on a single Jetson, with no internet, and degrades gracefully when anything goes wrong.
5.3.2.5 Session Management
Every other node on this page, fusion, inference, sparring, LLM, needs to know
whether a training session is currently active, resting, or idle. Without a single
authority for that, each node would have to guess from its own data, drift out of
sync with the others, and either keep running compute it doesn't need (wasting GPU)
or miss data it should be capturing. The session_manager node is the
single source of truth: it owns the lifecycle, broadcasts the current state at 2 Hz
as a heartbeat, and every other node treats that heartbeat as the truth.
The 5 states and what each one triggers
Every training session, regardless of mode (techniques, sparring, free, performance tests), flows through the same 5 states. Each transition isn't just a label change, it triggers coordinated side effects across the rest of the system:
imu_node switches to navigation mode (pad hits become menu navigation, not punch events). No data is recorded.imu_node switches to training mode, punch_processor starts emitting confirmed punches, the sparring engine starts attacking, and the LLM coach begins emitting tips every ~18 s. All session statistics accumulate.Configurable cycle & session summary
The default cycle runs 3 rounds of 180 s work + 60 s rest, but
the round count, work duration, and rest duration are all configurable per session via
the SessionConfig message. The same 5-state machine drives every training
mode, so the GUI, the dashboard, and every ROS node only ever have to handle one
lifecycle.
When the session reaches the complete state, the
session_manager writes a comprehensive summary to the user's database that
powers every analytics view on the Dashboard:
- Punch metrics, total count, distribution by punch type, peak and average force, punches per minute
- Defence metrics, defence rate, breakdown by defence type (block / slip / dodge / hit), reaction times
- Movement metrics, max lateral and depth displacement, movement timeline, person direction changes
- Coaching context, the round-by-round LLM tips emitted during the session and the post-session analysis text
This summary is what makes the per-user database useful. Without it, every analytics view, trend chart, and personal record on the dashboard would have nothing to read from.