Appendix 7: Robot Intelligence
A. Detailed Feature Exploration, 9 Iterations
Section 5.3.1.2 summarises the development journey in a table. This appendix provides the full technical detail for each iteration, architectures tested, feature dimensions, and the precise failure modes that motivated each transition.
Feature Visualisation Comparison
Each video below shows the same boxing footage processed through the corresponding feature extraction method, illustrating the evolution from raw pixels to the final voxel+pose representation.
| # | Approach | Features & Architecture | Outcome |
|---|---|---|---|
| 1 | 2D Pose + LSTM/GCN | 17 COCO keypoints via YOLO-Pose, 12-channel skeleton features. Models: LSTM, AAGCN (~2.5M params), ST-GCN, ST-GCN++ | Pipeline validated. Training data angles did not match robot camera. |
| 2 | 3D Pose (depth lifting) | 2D keypoints lifted to 3D via depth maps with skeleton constraints. Models: AAGCN, MotionBERT, PoseC3D | ~67%. Glove occlusion blocks wrists at punch extension. |
| 3 | Raw RGBD + 3D CNNs | 4-channel input (RGB + depth), 112×112 crops. Models: Swin3D-B (~88M), MViT V2, R3D-18 (~33M) | 33–88M params, too large for edge. Batch size 2 on Colab. |
| 4 | RGBD + Segmentation + Causal Transformer | YOLO segmentation + frozen EfficientNet-B2 (512-dim). CausalActionTransformer: 6 layers, 8 heads, ~3M trainable / ~35M total. Progressive prediction loss at all timesteps. | Still pixel-based, still required Colab. |
| 5 | Z-Mask (depth grid) | 1,156 dims/frame: 12×12 depth grid, RGB grid, centre of mass, velocity derivatives. BoxingCAT: 1D conv + 4 causal transformer layers + future prediction head. | 2D grid lost 3D spatial info. Hooks vs uppercuts indistinguishable. |
| 6 | Voxel + Colour Tracking | 1,749 dims/frame: 12³ voxel delta + optical flow + glove colour tracking (18 dims). VoxelFlowModel (~1.2M): Conv3D + MLP branches, concatenation fusion. | ~96%. Required coloured gloves. HSV tracking broke when similar colours appeared in the frame (background props, clothing). Insight: voxels need hand info. |
| 7 | Voxel-Only Transformer | 55,296 dims/frame: 24³ grid × 4 channels (delta + directional gradients). CausalVoxelTransformerModel (~2M): Conv3D stem, 4-layer causal transformer. | 80.6%. 19/56 jabs → cross (34% L/R error). Insight: need pose for L/R. |
| 7b | Voxel Optimisation | Grid 24³ → 12³ (55,296 → 17,280 dims). Multi-scale deltas (67ms + 267ms). Channel_p90 normalisation. | Real-time capable. Established 12³ grid for all subsequent iterations. |
| 8 | Voxel + Pose Fusion | 3,480 dims/frame: 12³ × 2ch voxel (3,456) + 24-dim YOLO pose. Gradients removed (5× reduction). Confidence gating, pose dropout (30% clip + 20% joint). | 97.3%, Breakthrough. L/R confusion eliminated. Crucially, this also unlocked glove-agnostic deployment, YOLO Pose locates wrists visually, so the model now works with any boxing gloves (or even bare hands), without the iter 6 constraint of needing red/green colour markers. Full circle, pose returned. |
| 9 | Production (v11) | Pose expanded to 42 dims. Block annotation fix. Horizontal flip augmentation. TensorRT FP16, parallel threads, RollingFeatureBuffer. | 96.8%, Deployed. Correct live block detection. |
B. Full Unit Test Breakdown
146 unit tests across 3 test files, supported by 12 shared pytest fixtures
in conftest.py.
Shared Test Fixtures
db_manager Fresh database per testsample_user Test account (testuser)sample_profiles Beginner / intermediate / coachsample_session Training session, 87 punchessample_sparring_session Sparring, 210 punchessample_punch_events CV + IMU + confirmed punchessample_defense_data Defense evaluation scenariosComplete Test Suite
| Test Class | Tests | What It Verifies |
|---|---|---|
test_punch_fusion.py: 44 tests (sensor fusion) |
||
| TestRingBuffer | 7 | Buffer eviction at max capacity, time-based expiry, pop_match retrieval |
| TestCVIMUFusion | 3 | Temporal matching within/outside ±500ms window, closest-first priority |
| TestPadConstraints | 10 | Each punch type only accepted on valid pads (jab→centre, left_hook→left, etc.) |
| TestReclassification | 8 | Secondary prediction fallback when primary violates pad constraint |
| TestExpiredEvents | 4 | CV and IMU events correctly discarded after timeout |
| TestDefenseClassification | 8 | Priority ordering: HIT > BLOCK > SLIP > DODGE > UNKNOWN |
| TestSessionStats | 4 | Punch accumulation, peak force tracking, defense statistics |
test_gamification.py: 30 tests (progression system) |
||
| TestRankSystem | 10 | All 6 rank tiers (Novice→Elite), XP boundary conditions, rank transitions |
| TestSessionXP | 7 | Base XP per mode, completion bonus, difficulty multiplier, streak bonus, minimum floor |
| TestSessionScore | 5 | Perfect, zero, and partial score calculations, boundary edge cases |
| TestAchievements | 5 | Unlock conditions, duplicate prevention, achievement listing |
| TestLeaderboard | 3 | Ranking computation, tie-breaking rules, empty leaderboard handling |
test_database.py: 45+ tests (data persistence) |
||
| TestUserManagement | 13 | Create, duplicate prevention, password verify, get/list users, coach account type |
| TestPatternLock | 5 | Set/verify pattern (SHA-256), wrong pattern rejection, overwrite |
| TestGuestSessions | 5 | Create guest token, claim by registered user, double-claim prevention |
| TestPresets | 7 | CRUD operations, invalid field rejection, use count tracking, favourites |
| TestTrainingSessions | 8 | Save/retrieve sessions, mode filtering, event logging, config persistence |
| TestGamification | 7 | XP addition, rank progression, personal records, streaks, achievement persistence |
Table: Complete breakdown of 146 unit tests across 3 test files and 18 test classes
C. Pose Feature Breakdown (42 dimensions)
Section 5.3.1.1 references a 42-dimensional pose feature vector extracted from YOLO Pose. This appendix itemises every dimension. All coordinates are normalised by shoulder midpoint and shoulder width to be person-invariant. If shoulder width < 15 pixels, the entire pose vector is zeroed (invalid detection).
Static Features (26 dims)
| Indices | Dims | Content | Purpose |
|---|---|---|---|
[0:14] | 14 | Joint coordinates (x, y for 7 joints) | Where each joint is, shoulder-normalised |
[14:21] | 7 | Joint confidence scores | YOLO detection reliability, used for confidence gating |
[21:23] | 2 | Arm extension ratios | Wrist-to-shoulder distance / shoulder width per arm |
[23] | 1 | Shoulder rotation | Body orientation (horizontal shoulder span ratio) |
[24:26] | 2 | Elbow angles | Hook vs jab: 0 = fully bent, 1 = fully straight (arccos/π normalised) |
Velocity Features (16 dims)
| Indices | Dims | Content | Purpose |
|---|---|---|---|
[0:14] | 14 | Joint velocities (dx, dy per joint) | Which hand is moving, in what direction |
[14:16] | 2 | Arm extension rate | Positive = extending (punch), negative = retracting |
7 upper-body COCO joints used: Nose, L/R Shoulder, L/R Elbow, L/R Wrist.
First frame of each clip has zero velocity (no previous frame).
The same YOLO model (yolo26n-pose.pt) is used for both training extraction
and live inference, so training noise matches deployment noise.
D. Integration Challenges & Solutions
During integration, four specific challenges were encountered that required architectural solutions beyond what the individual components were designed for.
Solution: Single-owner architecture. The
cv_node exclusively owns
the camera and publishes frames to ROS topics. All other consumers subscribe to these
topics. This also ensures a consistent frame rate for the action model.
Solution: Three synchronisation mechanisms:
- Shared SQLite database — GUI writes session data via
session_manager, dashboard reads via API queries - JSON file polling — Dashboard writes commands to
/tmp/boxbunny_gui_command.json, GUI polls every 100ms - WebSocket state buffering — Dashboard backend buffers last known state per user, sends to phone on reconnect
Solution: Five mechanisms ensure complete, clean output:
- Short max_tokens (128) — limits response length so the model finishes within budget
- System prompt instruction — explicitly says “keep tips SHORT (1–2 sentences)”
- Inference timeout (20s) — background thread with hard timeout prevents hangs
- Markdown stripping —
_clean_markdown()removes formatting artifacts - Sentence completion — if a response doesn’t end with
.,!, or?, 8–32 additional tokens are generated to finish the thought
E. Standalone Deployment Package
A self-contained action_prediction/ package for running live inference on the
Jetson without depending on the training pipeline. All models auto-convert on first run.
Conversion Flow
.pth → .onnx → .trt (Action model, TensorRT FP16)
|
.pt → .engine (YOLO Pose, Ultralytics TensorRT FP16)
Engines are cached next to source files. Subsequent runs load instantly.
Inference Latency (Jetson Orin NX, TensorRT FP16)
| Stage | Latency |
|---|---|
| YOLO Pose | ~16 ms |
| Action model | ~8 ms |
| Depth to voxel | ~8–10 ms |
| Total per frame | ~24 ms (~42 fps theoretical, 30 fps practical) |
Known Jetson Issues
| Issue | Workaround |
|---|---|
D435i IMU not working — hid_sensor_hub kernel module missing from Tegra kernel | Use --camera-pitch flag to set tilt manually |
YOLO TensorRT overlay draws at wrong coordinates when using .engine | Visualisation-only; pose features are extracted correctly |
| numpy 2.x crashes Jetson torch wheel | Pin to numpy>=1.26,<2.0 |
F. ROS Message & Service Catalogue
The 5.3 landing page references 21 custom message types and 6 services. This section lists every definition with its key fields.
Custom Messages (21)
| # | Message | Key Fields |
|---|---|---|
| 1 | PadImpact | timestamp, pad (left/centre/right/head), level (light/medium/hard), accel_magnitude |
| 2 | ArmStrike | timestamp, arm (left/right), contact (bool) |
| 3 | ArmStrikeEvent | timestamp, arm (left/right), contact (bool) |
| 4 | IMUStatus | left/centre/right/head pad connected flags, left/right arm connected, is_simulator |
| 5 | PunchEvent | timestamp, pad, level, force_normalized (0.33/0.66/1.0), accel_magnitude |
| 6 | NavCommand | timestamp, command (prev/next/enter/back) |
| 7 | PunchDetection | timestamp, punch_type, confidence, raw_class, consecutive_frames |
| 8 | PoseEstimate | timestamp, keypoints (COCO-17 flattened), movement_delta |
| 9 | UserTracking | timestamp, bbox_centre_x/y, bbox_top_y, bbox_width/height, depth, lateral/depth displacement, user_detected |
| 10 | ConfirmedPunch | timestamp, punch_type, pad, level, force_normalized, cv_confidence, imu_confirmed, cv_confirmed, accel_magnitude |
| 11 | DefenseEvent | timestamp, arm, robot_punch_code, struck, defense_type (block/slip/dodge/unknown) |
| 12 | SessionPunchSummary | total_punches, punch/force/pad distribution JSON, defense_rate, defense_type_breakdown, movement metrics, session_duration, rounds_completed |
| 13 | SessionState | state (idle/countdown/active/rest/complete), mode, username |
| 14 | SessionConfig | mode, difficulty, combo_sequence (JSON), rounds, work/rest time, speed, style |
| 15 | RobotCommand | command_type (punch/set_speed), punch_code (1-6), speed (slow/medium/fast), source |
| 16 | HeightCommand | target_height_px, current_height_px, action (adjust/calibrate/manual_up/manual_down/stop) |
| 17 | RoundControl | action (start/stop) |
| 18 | DrillDefinition | drill_name, difficulty, combo_sequence (string[]), total_combos, target_speed |
| 19 | DrillEvent | timestamp, event_type (combo_started/completed/missed/partial), combo_index, accuracy, timing_score |
| 20 | DrillProgress | timestamp, combos_completed/remaining, overall_accuracy, current/best_streak |
| 21 | CoachTip | timestamp, tip_text, tip_type (technique/encouragement/correction/suggestion), trigger, priority (0-2) |
Services (6)
| Service | Request | Response |
|---|---|---|
StartSession | mode, difficulty, config_json, username | success, session_id, message |
EndSession | session_id | success, summary_json, message |
StartDrill | drill_name, difficulty, rounds, work/rest time, speed | success, drill_id, message |
SetImuMode | mode (navigation/training) | success, current_mode |
CalibrateImuPunch | pad (left/centre/right/head/all) | success, message |
GenerateLlm | prompt, context_json, system_prompt_key | success, response, generation_time_sec |
Constants
| Class | Values |
|---|---|
PunchType | jab, cross, left_hook, right_hook, left_uppercut, right_uppercut, block, idle |
PadLocation | left, centre, right, head |
ImpactLevel | light (0.33), medium (0.66), hard (1.0) |
SessionState | idle, countdown, active, rest, complete |
TrainingMode | training, sparring, free, power, stamina, reaction |
Difficulty | beginner, intermediate, advanced |
MotorSpeed | slow (8.0 rad/s), medium (15.0), fast (25.0), max (30.0) |
DefenseType | block, slip, dodge, hit, unknown |
NavCommand | prev (left pad), next (right pad), enter (centre pad), back (head pad) |
G. Sparring Styles & Training Modes
Section 5.3.2.3 references 5 adaptive AI sparring styles driven by Markov chain transition matrices. This appendix provides the full behavioural profile for each style.
AI Sparring Styles (5)
| Style | Profile | Behaviour |
|---|---|---|
| Boxer | Technical, adaptive, balanced | Mixes all punch types with balanced probabilities. Uses combinations rather than single punches. Moderate attack frequency with clean technique transitions. |
| Brawler | Aggressive, predictable, high-volume | Heavily weighted toward power punches (hooks and crosses). Short gaps between attacks. High repetition of the same punch type. Predictable but overwhelming. |
| Counter-Puncher | Reactive, patient | Low base attack frequency. Attack probability spikes immediately after detecting a user punch. Favours quick, precise counters (jabs and crosses). Long idle periods followed by sudden bursts. |
| Pressure | Relentless, overwhelming | Very high attack frequency. Minimal gaps between punches. Mixes all punch types with shorter wind-up. Designed to test defensive endurance. |
| Switch | Unpredictable, mixed strategy | Periodically transitions between the other four styles. Transition timing is randomised. Creates an unpredictable opponent requiring continuous adaptation. |
Combo Drill Library (50 combinations)
50 progressive combinations across three difficulty levels, defined in config/drills.yaml:
Performance Tests
| Test | Description | Key Metrics |
|---|---|---|
| Power | Measures maximum punch force via IMU accelerometer | peak_force, avg_force, punch_count |
| Stamina | 120-second sustained effort. Fatigue index = (punch rate last 30s) / (punch rate first 30s) | total_punches, punches_per_minute, fatigue_index |
| Reaction | 3 trials of visual stimulus using YOLO pose detection. Tiers: Lightning, Fast, Average, Developing | avg/best/worst reaction_ms, tier |
H. Integration Test Breakdown
28 integration tests in notebooks/scripts/test_integration.py, covering
7 categories that verify cross-module communication:
| Category | What It Verifies |
|---|---|
| Config loading & YAML validation | All configuration files parse correctly and contain required fields |
| Pad-location constraint mapping | Each punch type is only accepted on valid pads (e.g. jab/cross → centre pad only) |
| CV+IMU fusion algorithm | Temporal matching, ring buffer behaviour, and fusion window correctness |
| ROS message field verification | All 21 custom message types contain the expected fields and types |
| Motor protocol validation | Punch codes 1–6 map correctly to the expected motor commands |
| Reaction time detection logic | Depth-based proximity threshold triggers at the correct distances |
| Punch sequence file parsing | All combo drill YAML files parse correctly and contain valid punch sequences |