IS-431: Appendix 7 - Robot Intelligence

Appendix 7: Robot Intelligence

Table of Contents

A. Feature Exploration B. Unit Tests C. Pose Features D. Integration Challenges E. Deployment F. ROS Catalogue G. Sparring & Training H. Integration Tests I. Training & Architecture Search J. Repositories

A. Detailed Feature Exploration, 9 Iterations

Section 5.3.1.2 summarises the development journey in a table. This appendix provides the full technical detail for each iteration, architectures tested, feature dimensions, and the precise failure modes that motivated each transition.

Feature Visualisation Comparison

Each video below shows the same boxing footage processed through the corresponding feature extraction method, illustrating the evolution from raw pixels to the final voxel+pose representation.

_{1. RGBD Baseline}

_{2. 2D Skeleton}

_{3. 3D Skeleton + RGBD}

_{4. Voxel Delta}

_{5. RGBD Bbox Crop}

_{6. Segmentation + Depth}

_{7. Z-Mask Grid}

_{8. Colour Tracking}

#	Approach	Features & Architecture	Outcome
1	2D Pose + LSTM/GCN	17 COCO keypoints via YOLO-Pose, 12-channel skeleton features. Models: LSTM, AAGCN (~2.5M params), ST-GCN, ST-GCN++	Pipeline validated. Training data angles did not match robot camera.
2	3D Pose (depth lifting)	2D keypoints lifted to 3D via depth maps with skeleton constraints. Models: AAGCN, MotionBERT, PoseC3D	~67%. Glove occlusion blocks wrists at punch extension.
3	Raw RGBD + 3D CNNs	4-channel input (RGB + depth), 112×112 crops. Models: Swin3D-B (~88M), MViT V2, R3D-18 (~33M)	33–88M params, too large for edge. Batch size 2 on Colab.
4	RGBD + Segmentation + Causal Transformer	YOLO segmentation + frozen EfficientNet-B2 (512-dim). CausalActionTransformer: 6 layers, 8 heads, ~3M trainable / ~35M total. Progressive prediction loss at all timesteps.	Still pixel-based, still required Colab.
5	Z-Mask (depth grid)	1,156 dims/frame: 12×12 depth grid, RGB grid, centre of mass, velocity derivatives. BoxingCAT: 1D conv + 4 causal transformer layers + future prediction head.	2D grid lost 3D spatial info. Hooks vs uppercuts indistinguishable.
6	Voxel + Colour Tracking	1,749 dims/frame: 12³ voxel delta + optical flow + glove colour tracking (18 dims). VoxelFlowModel (~1.2M): Conv3D + MLP branches, concatenation fusion.	~96%. Required coloured gloves. HSV tracking broke when similar colours appeared in the frame (background props, clothing). Insight: voxels need hand info.
7	Voxel-Only Transformer	55,296 dims/frame: 24³ grid × 4 channels (delta + directional gradients). CausalVoxelTransformerModel (~2M): Conv3D stem, 4-layer causal transformer.	80.6%. 19/56 jabs → cross (34% L/R error). Insight: need pose for L/R.
7b	Voxel Optimisation	Grid 24³ → 12³ (55,296 → 17,280 dims). Multi-scale deltas (67ms + 267ms). Channel_p90 normalisation.	Real-time capable. Established 12³ grid for all subsequent iterations.
8	Voxel + Pose Fusion	3,480 dims/frame: 12³ × 2ch voxel (3,456) + 24-dim YOLO pose. Gradients removed (5× reduction). Confidence gating, pose dropout (30% clip + 20% joint).	97.3%, Breakthrough. L/R confusion eliminated. Crucially, this also unlocked glove-agnostic deployment, YOLO Pose locates wrists visually, so the model now works with any boxing gloves (or even bare hands), without the iter 6 constraint of needing red/green colour markers. Full circle, pose returned.
9	Production (v11)	Pose expanded to 42 dims. Block annotation fix. Horizontal flip augmentation. TensorRT FP16, parallel threads, RollingFeatureBuffer.	96.8%, Deployed. Correct live block detection.

B. Full Unit Test Breakdown

146 unit tests across 3 test files, supported by 12 shared pytest fixtures in conftest.py.

Shared Test Fixtures

db_manager Fresh database per test

sample_user Test account (testuser)

sample_profiles Beginner / intermediate / coach

sample_session Training session, 87 punches

sample_sparring_session Sparring, 210 punches

sample_punch_events CV + IMU + confirmed punches

sample_defense_data Defense evaluation scenarios

Complete Test Suite

Test Class	Tests	What It Verifies
`test_punch_fusion.py`: 44 tests (sensor fusion)
TestRingBuffer	7	Buffer eviction at max capacity, time-based expiry, pop_match retrieval
TestCVIMUFusion	3	Temporal matching within/outside ±500ms window, closest-first priority
TestPadConstraints	10	Each punch type only accepted on valid pads (jab→centre, left_hook→left, etc.)
TestReclassification	8	Secondary prediction fallback when primary violates pad constraint
TestExpiredEvents	4	CV and IMU events correctly discarded after timeout
TestDefenseClassification	8	Priority ordering: HIT > BLOCK > SLIP > DODGE > UNKNOWN
TestSessionStats	4	Punch accumulation, peak force tracking, defense statistics
`test_gamification.py`: 30 tests (progression system)
TestRankSystem	10	All 6 rank tiers (Novice→Elite), XP boundary conditions, rank transitions
TestSessionXP	7	Base XP per mode, completion bonus, difficulty multiplier, streak bonus, minimum floor
TestSessionScore	5	Perfect, zero, and partial score calculations, boundary edge cases
TestAchievements	5	Unlock conditions, duplicate prevention, achievement listing
TestLeaderboard	3	Ranking computation, tie-breaking rules, empty leaderboard handling
`test_database.py`: 45+ tests (data persistence)
TestUserManagement	13	Create, duplicate prevention, password verify, get/list users, coach account type
TestPatternLock	5	Set/verify pattern (SHA-256), wrong pattern rejection, overwrite
TestGuestSessions	5	Create guest token, claim by registered user, double-claim prevention
TestPresets	7	CRUD operations, invalid field rejection, use count tracking, favourites
TestTrainingSessions	8	Save/retrieve sessions, mode filtering, event logging, config persistence
TestGamification	7	XP addition, rank progression, personal records, streaks, achievement persistence

Table: Complete breakdown of 146 unit tests across 3 test files and 18 test classes

C. Pose Feature Breakdown (42 dimensions)

Section 5.3.1.1 references a 42-dimensional pose feature vector extracted from YOLO Pose. This appendix itemises every dimension. All coordinates are normalised by shoulder midpoint and shoulder width to be person-invariant. If shoulder width < 15 pixels, the entire pose vector is zeroed (invalid detection).

Static Features (26 dims)

Indices	Dims	Content	Purpose
`[0:14]`	14	Joint coordinates (x, y for 7 joints)	Where each joint is, shoulder-normalised
`[14:21]`	7	Joint confidence scores	YOLO detection reliability, used for confidence gating
`[21:23]`	2	Arm extension ratios	Wrist-to-shoulder distance / shoulder width per arm
`[23]`	1	Shoulder rotation	Body orientation (horizontal shoulder span ratio)
`[24:26]`	2	Elbow angles	Hook vs jab: 0 = fully bent, 1 = fully straight (arccos/π normalised)

Velocity Features (16 dims)

Indices	Dims	Content	Purpose
`[0:14]`	14	Joint velocities (dx, dy per joint)	Which hand is moving, in what direction
`[14:16]`	2	Arm extension rate	Positive = extending (punch), negative = retracting

7 upper-body COCO joints used: Nose, L/R Shoulder, L/R Elbow, L/R Wrist. First frame of each clip has zero velocity (no previous frame). The same YOLO model (yolo26n-pose.pt) is used for both training extraction and live inference, so training noise matches deployment noise.

D. Integration Challenges & Solutions

During integration, four specific challenges were encountered that required architectural solutions beyond what the individual components were designed for.

1. Camera ownership conflicts

Problem: Multiple components need camera access (CV model, gesture recognition, reaction test), but the RealSense only allows one process to open the camera.
Solution: Single-owner architecture. The cv_node exclusively owns the camera and publishes frames to ROS topics. All other consumers subscribe to these topics. This also ensures a consistent frame rate for the action model.

2. GUI-Dashboard state synchronisation

Problem: The desktop GUI and phone dashboard both need to show the same training state, but communicate through different channels (GUI uses ROS, dashboard uses HTTP/WebSocket).
Solution: Three synchronisation mechanisms:

Shared SQLite database: GUI writes session data via session_manager, dashboard reads via API queries
JSON file polling: Dashboard writes commands to /tmp/boxbunny_gui_command.json, GUI polls every 100ms
WebSocket state buffering: Dashboard backend buffers last known state per user, sends to phone on reconnect

3. LLM response reliability

Problem: The LLM can produce incomplete, cut-off, or markdown-cluttered responses under GPU contention.
Solution: Five mechanisms ensure complete, clean output:

Short max_tokens (128): limits response length so the model finishes within budget
System prompt instruction: explicitly says “keep tips SHORT (1–2 sentences)”
Inference timeout (20s): background thread with hard timeout prevents hangs
Markdown stripping: _clean_markdown() removes formatting artifacts
Sentence completion: if a response doesn’t end with ., !, or ?, 8–32 additional tokens are generated to finish the thought

E. Standalone Deployment Package

A self-contained action_prediction/ package for running live inference on the Jetson without depending on the training pipeline. All models auto-convert on first run.

Conversion Flow

.pth → .onnx → .trt (Action model, TensorRT FP16) | .pt → .engine (YOLO Pose, Ultralytics TensorRT FP16)

Engines are cached next to source files. Subsequent runs load instantly.

Inference Latency (Jetson Orin NX, TensorRT FP16)

Stage	Latency
YOLO Pose	~16 ms
Action model	~8 ms
Depth to voxel	~8–10 ms
Total per frame	~24 ms (~42 fps theoretical, 30 fps practical)

Known Jetson Issues

Issue	Workaround
D435i IMU not working: `hid_sensor_hub` kernel module missing from Tegra kernel	Use `--camera-pitch` flag to set tilt manually
YOLO TensorRT overlay draws at wrong coordinates when using `.engine`	Visualisation-only; pose features are extracted correctly
numpy 2.x crashes Jetson torch wheel	Pin to `numpy>=1.26,<2.0`

F. ROS Message & Service Catalogue

The 5.3 landing page references 21 custom message types and 6 services. This section lists every definition with its key fields.

Custom Messages (21)

#	Message	Key Fields
1	`PadImpact`	timestamp, pad (left/centre/right/head), level (light/medium/hard), accel_magnitude
2	`ArmStrike`	timestamp, arm (left/right), contact (bool)
3	`ArmStrikeEvent`	timestamp, arm (left/right), contact (bool)
4	`IMUStatus`	left/centre/right/head pad connected flags, left/right arm connected, is_simulator
5	`PunchEvent`	timestamp, pad, level, force_normalized (0.33/0.66/1.0), accel_magnitude
6	`NavCommand`	timestamp, command (prev/next/enter/back)
7	`PunchDetection`	timestamp, punch_type, confidence, raw_class, consecutive_frames
8	`PoseEstimate`	timestamp, keypoints (COCO-17 flattened), movement_delta
9	`UserTracking`	timestamp, bbox_centre_x/y, bbox_top_y, bbox_width/height, depth, lateral/depth displacement, user_detected
10	`ConfirmedPunch`	timestamp, punch_type, pad, level, force_normalized, cv_confidence, imu_confirmed, cv_confirmed, accel_magnitude
11	`DefenseEvent`	timestamp, arm, robot_punch_code, struck, defense_type (block/slip/dodge/unknown)
12	`SessionPunchSummary`	total_punches, punch/force/pad distribution JSON, defense_rate, defense_type_breakdown, movement metrics, session_duration, rounds_completed
13	`SessionState`	state (idle/countdown/active/rest/complete), mode, username
14	`SessionConfig`	mode, difficulty, combo_sequence (JSON), rounds, work/rest time, speed, style
15	`RobotCommand`	command_type (punch/set_speed), punch_code (1-6), speed (slow/medium/fast), source
16	`HeightCommand`	target_height_px, current_height_px, action (adjust/calibrate/manual_up/manual_down/stop)
17	`RoundControl`	action (start/stop)
18	`DrillDefinition`	drill_name, difficulty, combo_sequence (string[]), total_combos, target_speed
19	`DrillEvent`	timestamp, event_type (combo_started/completed/missed/partial), combo_index, accuracy, timing_score
20	`DrillProgress`	timestamp, combos_completed/remaining, overall_accuracy, current/best_streak
21	`CoachTip`	timestamp, tip_text, tip_type (technique/encouragement/correction/suggestion), trigger, priority (0-2)

Services (6)

Service	Request	Response
`StartSession`	mode, difficulty, config_json, username	success, session_id, message
`EndSession`	session_id	success, summary_json, message
`StartDrill`	drill_name, difficulty, rounds, work/rest time, speed	success, drill_id, message
`SetImuMode`	mode (navigation/training)	success, current_mode
`CalibrateImuPunch`	pad (left/centre/right/head/all)	success, message
`GenerateLlm`	prompt, context_json, system_prompt_key	success, response, generation_time_sec

Constants

Class	Values
`PunchType`	jab, cross, left_hook, right_hook, left_uppercut, right_uppercut, block, idle
`PadLocation`	left, centre, right, head
`ImpactLevel`	light (0.33), medium (0.66), hard (1.0)
`SessionState`	idle, countdown, active, rest, complete
`TrainingMode`	training, sparring, free, power, stamina, reaction
`Difficulty`	beginner, intermediate, advanced
`MotorSpeed`	slow (8.0 rad/s), medium (15.0), fast (25.0), max (30.0)
`DefenseType`	block, slip, dodge, hit, unknown
`NavCommand`	prev (left pad), next (right pad), enter (centre pad), back (head pad)

G. Sparring Styles & Training Modes

Section 5.3.2.3 references 5 adaptive AI sparring styles driven by Markov chain transition matrices. This appendix provides the full behavioural profile for each style.

AI Sparring Styles (5)

Style	Profile	Behaviour
Boxer	Technical, adaptive, balanced	Mixes all punch types with balanced probabilities. Uses combinations rather than single punches. Moderate attack frequency with clean technique transitions.
Brawler	Aggressive, predictable, high-volume	Heavily weighted toward power punches (hooks and crosses). Short gaps between attacks. High repetition of the same punch type. Predictable but overwhelming.
Counter-Puncher	Reactive, patient	Low base attack frequency. Attack probability spikes immediately after detecting a user punch. Favours quick, precise counters (jabs and crosses). Long idle periods followed by sudden bursts.
Pressure	Relentless, overwhelming	Very high attack frequency. Minimal gaps between punches. Mixes all punch types with shorter wind-up. Designed to test defensive endurance.
Switch	Unpredictable, mixed strategy	Periodically transitions between the other four styles. Transition timing is randomised. Creates an unpredictable opponent requiring continuous adaptation.

Combo Drill Library (50 combinations)

50 progressive combinations across three difficulty levels, defined in config/drills.yaml:

Beginner (15)

2–3 punches per combo. Fundamental jab, cross, and basic hook combinations (e.g. 1-2, 1-1-2, 1-2-3).

Intermediate (20)

3–5 punches per combo. All six punch types with mixed combinations and uppercut introductions.

Advanced (15)

4–8 punches per combo. Complex sequences requiring fluid transitions, directional changes, and rapid sequencing.

Performance Tests

Test	Description	Key Metrics
Power	Measures maximum punch force via IMU accelerometer	peak_force, avg_force, punch_count
Stamina	120-second sustained effort. Fatigue index = (punch rate last 30s) / (punch rate first 30s)	total_punches, punches_per_minute, fatigue_index
Reaction	3 trials of visual stimulus using YOLO pose detection. Tiers: Lightning, Fast, Average, Developing	avg/best/worst reaction_ms, tier

H. Integration Test Breakdown

28 integration tests in notebooks/scripts/test_integration.py, covering 7 categories that verify cross-module communication:

Category	What It Verifies
Config loading & YAML validation	All configuration files parse correctly and contain required fields
Pad-location constraint mapping	Each punch type is only accepted on valid pads (e.g. jab/cross → centre pad only)
CV+IMU fusion algorithm	Temporal matching, ring buffer behaviour, and fusion window correctness
ROS message field verification	All 21 custom message types contain the expected fields and types
Motor protocol validation	Punch codes 1–6 map correctly to the expected motor commands
Reaction time detection logic	Depth-based proximity threshold triggers at the correct distances
Punch sequence file parsing	All combo drill YAML files parse correctly and contain valid punch sequences

I. Training Hyperparameters & Architecture Search

Section 5.3.1.3 summarises the training setup in one paragraph. This appendix holds the full hyperparameter table and the 8-variant architecture search that preceded the final configuration.

Training Hyperparameters

Parameter	Value	What it does
Epochs	300 (early stop patience 60)	Max training passes; stops early if validation accuracy plateaus for 60 epochs to avoid overfitting.
Batch size	16	Samples per weight update, sized to fit GPU memory while keeping a stable gradient.
Optimiser	AdamW (LR 0.0005, weight decay 0.01)	Standard Transformer optimiser; weight decay regularises unused weights.
LR schedule	Warmup 20 epochs → cosine decay to 1e-6	Warmup stabilises early training; cosine decay makes later steps finer.
Loss	Focal loss (γ=1.5) + label smoothing (0.03)	Focal loss focuses on hard/rare classes; label smoothing curbs overconfidence.
Best epoch	73 / 133 trained	Epoch 73's checkpoint was saved and deployed; training continued to 133.
Random seed	42	Fixed across NumPy, PyTorch, and CUDA for reproducibility, so reruns on the same data and code produce the same split and the same trained weights.

Key augmentation, horizontal flip (50%): exploits boxing's left/right symmetry by flipping the voxel X-axis, swapping left/right pose features, and swapping labels (jab↔cross, left_hook↔right_hook, left_uppercut↔right_uppercut), effectively doubling the dataset. Other augmentations: speed (0.8-1.2×), voxel shift/cutout/stretch, multi-window (9, 12 frames).

Architecture Search

Before finalising the model configuration, 8 variants were tested systematically to find the optimal balance of architecture, augmentation, and attention mode:

Variant	Architecture	Key Change	Accuracy
v1	Causal, d=192	Heavy augmentation (all maxed)	86.0%
v2	Causal, d=192	Moderate augmentation	91.0%
v3	Causal, d=192	No horizontal flip	90.1%
v4	Causal, d=128, 2 layers	Smaller model + mixup	88.7%
v5	Bidirectional, d=192	Non-causal attention	92.8%
v6	Causal, d=192	Temporal stats features	88.2%
v7	Causal, d=192	Explicit pose velocity	87.3%
v8	Conv1D temporal	Replace attention with convolution	90.9%

Table: Architecture search, all variants tested before data curation

Key findings: heavy augmentation causes underfitting (v1 vs v2), smaller models underfit (v4), and Transformer attention outperforms Conv1D temporal encoding (v8). These results preceded the +4.5% data curation improvement, which pushed the final model to 97.3%.

J. Repository Links

Action Recognition

CV model training, feature extraction, annotation tools, data pipeline

Robot Workspace

ROS 2 system, all nodes, GUI, phone dashboard, tests, deployment

Back: Testing & Validation