Back to Robot Intelligence

Appendix 7: Robot Intelligence

Table of Contents
A. Feature Exploration B. Unit Tests C. Pose Features D. Integration Challenges E. Deployment F. ROS Catalogue G. Sparring & Training H. Integration Tests I. Repositories

A. Detailed Feature Exploration, 9 Iterations

Section 5.3.1.2 summarises the development journey in a table. This appendix provides the full technical detail for each iteration, architectures tested, feature dimensions, and the precise failure modes that motivated each transition.

Feature Visualisation Comparison

Each video below shows the same boxing footage processed through the corresponding feature extraction method, illustrating the evolution from raw pixels to the final voxel+pose representation.

1. RGBD Baseline
2. 2D Skeleton
3. 3D Skeleton + RGBD
4. Voxel Delta
5. RGBD Bbox Crop
6. Segmentation + Depth
7. Z-Mask Grid
8. Colour Tracking
# Approach Features & Architecture Outcome
1 2D Pose + LSTM/GCN 17 COCO keypoints via YOLO-Pose, 12-channel skeleton features. Models: LSTM, AAGCN (~2.5M params), ST-GCN, ST-GCN++ Pipeline validated. Training data angles did not match robot camera.
2 3D Pose (depth lifting) 2D keypoints lifted to 3D via depth maps with skeleton constraints. Models: AAGCN, MotionBERT, PoseC3D ~67%. Glove occlusion blocks wrists at punch extension.
3 Raw RGBD + 3D CNNs 4-channel input (RGB + depth), 112×112 crops. Models: Swin3D-B (~88M), MViT V2, R3D-18 (~33M) 33–88M params, too large for edge. Batch size 2 on Colab.
4 RGBD + Segmentation + Causal Transformer YOLO segmentation + frozen EfficientNet-B2 (512-dim). CausalActionTransformer: 6 layers, 8 heads, ~3M trainable / ~35M total. Progressive prediction loss at all timesteps. Still pixel-based, still required Colab.
5 Z-Mask (depth grid) 1,156 dims/frame: 12×12 depth grid, RGB grid, centre of mass, velocity derivatives. BoxingCAT: 1D conv + 4 causal transformer layers + future prediction head. 2D grid lost 3D spatial info. Hooks vs uppercuts indistinguishable.
6 Voxel + Colour Tracking 1,749 dims/frame: 12³ voxel delta + optical flow + glove colour tracking (18 dims). VoxelFlowModel (~1.2M): Conv3D + MLP branches, concatenation fusion. ~96%. Required coloured gloves. HSV tracking broke when similar colours appeared in the frame (background props, clothing). Insight: voxels need hand info.
7 Voxel-Only Transformer 55,296 dims/frame: 24³ grid × 4 channels (delta + directional gradients). CausalVoxelTransformerModel (~2M): Conv3D stem, 4-layer causal transformer. 80.6%. 19/56 jabs → cross (34% L/R error). Insight: need pose for L/R.
7b Voxel Optimisation Grid 24³ → 12³ (55,296 → 17,280 dims). Multi-scale deltas (67ms + 267ms). Channel_p90 normalisation. Real-time capable. Established 12³ grid for all subsequent iterations.
8 Voxel + Pose Fusion 3,480 dims/frame: 12³ × 2ch voxel (3,456) + 24-dim YOLO pose. Gradients removed (5× reduction). Confidence gating, pose dropout (30% clip + 20% joint). 97.3%, Breakthrough. L/R confusion eliminated. Crucially, this also unlocked glove-agnostic deployment, YOLO Pose locates wrists visually, so the model now works with any boxing gloves (or even bare hands), without the iter 6 constraint of needing red/green colour markers. Full circle, pose returned.
9 Production (v11) Pose expanded to 42 dims. Block annotation fix. Horizontal flip augmentation. TensorRT FP16, parallel threads, RollingFeatureBuffer. 96.8%, Deployed. Correct live block detection.

B. Full Unit Test Breakdown

146 unit tests across 3 test files, supported by 12 shared pytest fixtures in conftest.py.

Shared Test Fixtures

db_manager Fresh database per test
sample_user Test account (testuser)
sample_profiles Beginner / intermediate / coach
sample_session Training session, 87 punches
sample_sparring_session Sparring, 210 punches
sample_punch_events CV + IMU + confirmed punches
sample_defense_data Defense evaluation scenarios

Complete Test Suite

Test Class Tests What It Verifies
test_punch_fusion.py: 44 tests (sensor fusion)
TestRingBuffer7Buffer eviction at max capacity, time-based expiry, pop_match retrieval
TestCVIMUFusion3Temporal matching within/outside ±500ms window, closest-first priority
TestPadConstraints10Each punch type only accepted on valid pads (jab→centre, left_hook→left, etc.)
TestReclassification8Secondary prediction fallback when primary violates pad constraint
TestExpiredEvents4CV and IMU events correctly discarded after timeout
TestDefenseClassification8Priority ordering: HIT > BLOCK > SLIP > DODGE > UNKNOWN
TestSessionStats4Punch accumulation, peak force tracking, defense statistics
test_gamification.py: 30 tests (progression system)
TestRankSystem10All 6 rank tiers (Novice→Elite), XP boundary conditions, rank transitions
TestSessionXP7Base XP per mode, completion bonus, difficulty multiplier, streak bonus, minimum floor
TestSessionScore5Perfect, zero, and partial score calculations, boundary edge cases
TestAchievements5Unlock conditions, duplicate prevention, achievement listing
TestLeaderboard3Ranking computation, tie-breaking rules, empty leaderboard handling
test_database.py: 45+ tests (data persistence)
TestUserManagement13Create, duplicate prevention, password verify, get/list users, coach account type
TestPatternLock5Set/verify pattern (SHA-256), wrong pattern rejection, overwrite
TestGuestSessions5Create guest token, claim by registered user, double-claim prevention
TestPresets7CRUD operations, invalid field rejection, use count tracking, favourites
TestTrainingSessions8Save/retrieve sessions, mode filtering, event logging, config persistence
TestGamification7XP addition, rank progression, personal records, streaks, achievement persistence

Table: Complete breakdown of 146 unit tests across 3 test files and 18 test classes

C. Pose Feature Breakdown (42 dimensions)

Section 5.3.1.1 references a 42-dimensional pose feature vector extracted from YOLO Pose. This appendix itemises every dimension. All coordinates are normalised by shoulder midpoint and shoulder width to be person-invariant. If shoulder width < 15 pixels, the entire pose vector is zeroed (invalid detection).

Static Features (26 dims)

Indices Dims Content Purpose
[0:14]14Joint coordinates (x, y for 7 joints)Where each joint is, shoulder-normalised
[14:21]7Joint confidence scoresYOLO detection reliability, used for confidence gating
[21:23]2Arm extension ratiosWrist-to-shoulder distance / shoulder width per arm
[23]1Shoulder rotationBody orientation (horizontal shoulder span ratio)
[24:26]2Elbow anglesHook vs jab: 0 = fully bent, 1 = fully straight (arccos/π normalised)

Velocity Features (16 dims)

Indices Dims Content Purpose
[0:14]14Joint velocities (dx, dy per joint)Which hand is moving, in what direction
[14:16]2Arm extension ratePositive = extending (punch), negative = retracting

7 upper-body COCO joints used: Nose, L/R Shoulder, L/R Elbow, L/R Wrist. First frame of each clip has zero velocity (no previous frame). The same YOLO model (yolo26n-pose.pt) is used for both training extraction and live inference, so training noise matches deployment noise.

D. Integration Challenges & Solutions

During integration, four specific challenges were encountered that required architectural solutions beyond what the individual components were designed for.

1. Camera ownership conflicts
Problem: Multiple components need camera access (CV model, gesture recognition, reaction test), but the RealSense only allows one process to open the camera.
Solution: Single-owner architecture. The cv_node exclusively owns the camera and publishes frames to ROS topics. All other consumers subscribe to these topics. This also ensures a consistent frame rate for the action model.
2. GUI-Dashboard state synchronisation
Problem: The desktop GUI and phone dashboard both need to show the same training state, but communicate through different channels (GUI uses ROS, dashboard uses HTTP/WebSocket).
Solution: Three synchronisation mechanisms:
  1. Shared SQLite database — GUI writes session data via session_manager, dashboard reads via API queries
  2. JSON file polling — Dashboard writes commands to /tmp/boxbunny_gui_command.json, GUI polls every 100ms
  3. WebSocket state buffering — Dashboard backend buffers last known state per user, sends to phone on reconnect
3. LLM response reliability
Problem: The LLM can produce incomplete, cut-off, or markdown-cluttered responses under GPU contention.
Solution: Five mechanisms ensure complete, clean output:
  1. Short max_tokens (128) — limits response length so the model finishes within budget
  2. System prompt instruction — explicitly says “keep tips SHORT (1–2 sentences)”
  3. Inference timeout (20s) — background thread with hard timeout prevents hangs
  4. Markdown stripping_clean_markdown() removes formatting artifacts
  5. Sentence completion — if a response doesn’t end with ., !, or ?, 8–32 additional tokens are generated to finish the thought

E. Standalone Deployment Package

A self-contained action_prediction/ package for running live inference on the Jetson without depending on the training pipeline. All models auto-convert on first run.

Conversion Flow

.pth .onnx .trt (Action model, TensorRT FP16) | .pt .engine (YOLO Pose, Ultralytics TensorRT FP16)

Engines are cached next to source files. Subsequent runs load instantly.

Inference Latency (Jetson Orin NX, TensorRT FP16)

Stage Latency
YOLO Pose~16 ms
Action model~8 ms
Depth to voxel~8–10 ms
Total per frame~24 ms (~42 fps theoretical, 30 fps practical)

Known Jetson Issues

Issue Workaround
D435i IMU not working — hid_sensor_hub kernel module missing from Tegra kernelUse --camera-pitch flag to set tilt manually
YOLO TensorRT overlay draws at wrong coordinates when using .engineVisualisation-only; pose features are extracted correctly
numpy 2.x crashes Jetson torch wheelPin to numpy>=1.26,<2.0

F. ROS Message & Service Catalogue

The 5.3 landing page references 21 custom message types and 6 services. This section lists every definition with its key fields.

Custom Messages (21)

# Message Key Fields
1PadImpacttimestamp, pad (left/centre/right/head), level (light/medium/hard), accel_magnitude
2ArmStriketimestamp, arm (left/right), contact (bool)
3ArmStrikeEventtimestamp, arm (left/right), contact (bool)
4IMUStatusleft/centre/right/head pad connected flags, left/right arm connected, is_simulator
5PunchEventtimestamp, pad, level, force_normalized (0.33/0.66/1.0), accel_magnitude
6NavCommandtimestamp, command (prev/next/enter/back)
7PunchDetectiontimestamp, punch_type, confidence, raw_class, consecutive_frames
8PoseEstimatetimestamp, keypoints (COCO-17 flattened), movement_delta
9UserTrackingtimestamp, bbox_centre_x/y, bbox_top_y, bbox_width/height, depth, lateral/depth displacement, user_detected
10ConfirmedPunchtimestamp, punch_type, pad, level, force_normalized, cv_confidence, imu_confirmed, cv_confirmed, accel_magnitude
11DefenseEventtimestamp, arm, robot_punch_code, struck, defense_type (block/slip/dodge/unknown)
12SessionPunchSummarytotal_punches, punch/force/pad distribution JSON, defense_rate, defense_type_breakdown, movement metrics, session_duration, rounds_completed
13SessionStatestate (idle/countdown/active/rest/complete), mode, username
14SessionConfigmode, difficulty, combo_sequence (JSON), rounds, work/rest time, speed, style
15RobotCommandcommand_type (punch/set_speed), punch_code (1-6), speed (slow/medium/fast), source
16HeightCommandtarget_height_px, current_height_px, action (adjust/calibrate/manual_up/manual_down/stop)
17RoundControlaction (start/stop)
18DrillDefinitiondrill_name, difficulty, combo_sequence (string[]), total_combos, target_speed
19DrillEventtimestamp, event_type (combo_started/completed/missed/partial), combo_index, accuracy, timing_score
20DrillProgresstimestamp, combos_completed/remaining, overall_accuracy, current/best_streak
21CoachTiptimestamp, tip_text, tip_type (technique/encouragement/correction/suggestion), trigger, priority (0-2)

Services (6)

Service Request Response
StartSessionmode, difficulty, config_json, usernamesuccess, session_id, message
EndSessionsession_idsuccess, summary_json, message
StartDrilldrill_name, difficulty, rounds, work/rest time, speedsuccess, drill_id, message
SetImuModemode (navigation/training)success, current_mode
CalibrateImuPunchpad (left/centre/right/head/all)success, message
GenerateLlmprompt, context_json, system_prompt_keysuccess, response, generation_time_sec

Constants

Class Values
PunchTypejab, cross, left_hook, right_hook, left_uppercut, right_uppercut, block, idle
PadLocationleft, centre, right, head
ImpactLevellight (0.33), medium (0.66), hard (1.0)
SessionStateidle, countdown, active, rest, complete
TrainingModetraining, sparring, free, power, stamina, reaction
Difficultybeginner, intermediate, advanced
MotorSpeedslow (8.0 rad/s), medium (15.0), fast (25.0), max (30.0)
DefenseTypeblock, slip, dodge, hit, unknown
NavCommandprev (left pad), next (right pad), enter (centre pad), back (head pad)

G. Sparring Styles & Training Modes

Section 5.3.2.3 references 5 adaptive AI sparring styles driven by Markov chain transition matrices. This appendix provides the full behavioural profile for each style.

AI Sparring Styles (5)

Style Profile Behaviour
BoxerTechnical, adaptive, balancedMixes all punch types with balanced probabilities. Uses combinations rather than single punches. Moderate attack frequency with clean technique transitions.
BrawlerAggressive, predictable, high-volumeHeavily weighted toward power punches (hooks and crosses). Short gaps between attacks. High repetition of the same punch type. Predictable but overwhelming.
Counter-PuncherReactive, patientLow base attack frequency. Attack probability spikes immediately after detecting a user punch. Favours quick, precise counters (jabs and crosses). Long idle periods followed by sudden bursts.
PressureRelentless, overwhelmingVery high attack frequency. Minimal gaps between punches. Mixes all punch types with shorter wind-up. Designed to test defensive endurance.
SwitchUnpredictable, mixed strategyPeriodically transitions between the other four styles. Transition timing is randomised. Creates an unpredictable opponent requiring continuous adaptation.

Combo Drill Library (50 combinations)

50 progressive combinations across three difficulty levels, defined in config/drills.yaml:

Beginner (15)
2–3 punches per combo. Fundamental jab, cross, and basic hook combinations (e.g. 1-2, 1-1-2, 1-2-3).
Intermediate (20)
3–5 punches per combo. All six punch types with mixed combinations and uppercut introductions.
Advanced (15)
4–8 punches per combo. Complex sequences requiring fluid transitions, directional changes, and rapid sequencing.

Performance Tests

Test Description Key Metrics
PowerMeasures maximum punch force via IMU accelerometerpeak_force, avg_force, punch_count
Stamina120-second sustained effort. Fatigue index = (punch rate last 30s) / (punch rate first 30s)total_punches, punches_per_minute, fatigue_index
Reaction3 trials of visual stimulus using YOLO pose detection. Tiers: Lightning, Fast, Average, Developingavg/best/worst reaction_ms, tier

H. Integration Test Breakdown

28 integration tests in notebooks/scripts/test_integration.py, covering 7 categories that verify cross-module communication:

Category What It Verifies
Config loading & YAML validationAll configuration files parse correctly and contain required fields
Pad-location constraint mappingEach punch type is only accepted on valid pads (e.g. jab/cross → centre pad only)
CV+IMU fusion algorithmTemporal matching, ring buffer behaviour, and fusion window correctness
ROS message field verificationAll 21 custom message types contain the expected fields and types
Motor protocol validationPunch codes 1–6 map correctly to the expected motor commands
Reaction time detection logicDepth-based proximity threshold triggers at the correct distances
Punch sequence file parsingAll combo drill YAML files parse correctly and contain valid punch sequences

I. Repository Links

Action Recognition CV model training, feature extraction, annotation tools, data pipeline
Robot Workspace ROS 2 system, all nodes, GUI, phone dashboard, tests, deployment