Back to Robot Intelligence

5.3.4 Testing & Validation

Verifies RI-1 to RI-8
Validation CDE Fair Tests 8/8 Pass Limitations

The Robot Intelligence subsystem was validated through model performance and interpretability, early prototype deployment at the CDE AI Fair, comprehensive automated testing, and final system validation on the full robot with real hardware.

V-Model alignment. Following the Systems Engineering V-Model introduced in Section 3.2, this page corresponds to the right side of the V, where each Robot Intelligence requirement (RI-1 through RI-8) is closed against a concrete verification activity. Unit and integration tests verify the lower levels of the V (individual components and cross-module communication), the CDE AI Fair served as an integration-level test that surfaced real-world issues, and the full robot test with real camera, IMU, and motors closes the system-level verification. Section 5.3.4.7 Verification Summary ties every requirement to its result.

5.3.4.1 Model Validation & Interpretability

RI-3

This section presents how well the final model performs and provides visual evidence of what the model has learned internally. The goal is to verify that the model is making decisions for the right reasons, not just memorising patterns.

Overall Performance

Final Model (v11), Headline Results
96.8%
validation accuracy
on an unseen person
94.2%
macro F1-score
(class-balanced)
7 / 221
samples misclassified
on the validation set

Macro F1 averages F1 across all classes equally regardless of class size, which is a fairer measure than raw accuracy when classes are imbalanced. Model v10 reached 97.3% on the same split but had poorer live block detection, so v11 was selected for deployment after the block annotation fix.

Classification Results

The confusion matrix (left) shows where the model gets confused: each cell counts how often a true class (row) was predicted as another (column). The strong diagonal means the model is right most of the time, and the few off-diagonal cells cluster between biomechanically similar punches, left hook confused with left uppercut (both swing the left arm), and block confused with idle (both involve minimal movement).

The per-class accuracy chart (right) shows the hit rate per punch type. Most classes exceed 96%, with right hook and right uppercut at a perfect 100%. The notable exception is the block class, which is addressed in the callout below.

Block class, the hardest action to classify

Problem: block is the weakest class at 66.7%. The start of a block (arms rising from guard) looks nearly identical to the start of an uppercut (arms rising into the punch) for the first few frames.

Insight: the model only ever saw the onset of a block during training, so it had no way to disambiguate it from an uppercut without seeing the rest of the action.

Fix & live impact: the block annotation fix in 5.3.1 extended annotations to cover the full guard sequence (rise, hold, return). The validation headline number stayed in the same range, but live block reliability improved significantly, the gain came entirely from the annotation, not the model.

Training Convergence

The dashboard below tracks how the model improved during training. The loss (top-left) measures how wrong the model's predictions are, lower is better. The accuracy (top-right) shows the percentage of correct predictions on both training data (blue) and unseen validation data (pink). The gap between the two lines indicates overfitting, here the gap stays narrow, confirming the model generalises well rather than memorising training examples. The model reached its best performance at epoch 73 out of 133.

What the Model Learned (t-SNE Visualisation)

The model's internal 384-dimensional representations are compressed to 2D using t-SNE (a dimensionality-reduction method that preserves local similarity). Each dot is a punch sample coloured by its true class, if the model has learned meaningful features, same-class dots should cluster together and different classes should separate.

The plot below confirms exactly that: tight, well-separated clusters for each action class. The only clusters that partially overlap are left hook / left uppercut, the same pair that confuses the model in the confusion matrix above. The two visualisations agree.

Which Frames Matter Most

Gradient saliency measures how much the model's output changes with respect to each input, revealing which frames in the 12-frame window the model relied on most. The chart below shows a clear pattern: later frames are significantly more important than early ones. The most recent frame (index 11) scores highest, with frames 6–9 close behind, while the first few frames (indices 1–4) contribute the least.

Important, this is not the same as "predicting after the punch ends". The 12-frame window is a rolling causal window that always covers the most recent ~400 ms of motion. As a punch unfolds, that window slides forward with it, so “the latest frame in the window” is always the present moment, not the end of the punch. A new prediction is emitted every frame while the punch is still in progress.

What this chart actually says is that the model makes its decisions based on the most recent ~200 ms of motion rather than the deeper history. That is exactly what action prediction needs: the model commits to a punch type as soon as the discriminative motion appears in the recent past, not after the punch retracts. The very early frames in the window mostly capture the preparation phase before the discriminative motion has begun, so they reasonably contribute less.

5.3.4.2 CDE AI Fair, Real-World Deployment

RI-1 to RI-8

As introduced in the GUI testing section (5.1), BoxBunny was deployed at the "Robotics Meets AI Showcase" in late January 2025 for public interaction. From a Robot Intelligence perspective, the key context is that the Transformer-based action prediction model was not yet ready, instead, a simpler HSV (Hue-Saturation-Value, a colour space that separates colour from brightness) colour-based glove tracking system was implemented, requiring users to wear red and green gloves. The prototype ran 9 ROS 2 packages with 3 drill modes.

Procedure

An informal field deployment with public visitors, not a controlled study. The setup is summarised in the five blocks below:

Setting
BoxBunny showcase booth at the "Robotics Meets AI Showcase", January 2025. Running live for the full duration of the fair.
Participants
Dozens of public visitors, varying ages, heights, and skill levels. A much wider range of body types and movement styles than any lab test could cover.
Task flow
30-second briefing → put on red/green tracking gloves → try the three drills (reaction, shadow sparring, defence) at their own pace.
Data collection
Live observation by the team, CV stream stability, user posture, failure modes the moment they appeared. Captured as field notes + cross-referenced against software logs at end of day.
Outcome
6 distinct, reproducible failure modes that each drove a concrete architectural change in the final system, documented in the Findings section below.

Findings

Despite being a simpler prototype, the live deployment exposed real-world challenges that directly shaped the final system. Each issue below caused a visible failure in front of public users at the fair, and each one drove a concrete change to the final architecture.

✕  What failed at the fair ✓  What we built instead
1Background clutter
People walking behind the user triggered false punch detections.
1Nearest-person tracking
The closest and largest person in frame is now prioritised.
2Colour tracking failure
HSV glove tracking confused red/green gloves with similarly coloured background objects.
2Transformer action model
Colour tracking replaced entirely with the deep model in 5.3.1.
3Variable punching stances
Fixed velocity thresholds for reaction time were unreliable across users with different stances.
3Depth-based proximity
Reaction time now measured by hand depth crossing a calibrated threshold, stance-independent.
4LLM cold start
First coaching prompt of every session paused several seconds while the model loaded.
4Lazy load at GUI startup
The model is preloaded at GUI startup so the first prompt is already warm.
5Touchscreen input friction
Typing on the 7″ touchscreen with boxing gloves was practically impossible.
5Phone dashboard
The phone dashboard (5.3.3) provides a natural mobile interface for chat and configuration.
6Camera resource contention
Multiple CV consumers each opening their own camera handle caused USB contention and dropped frames.
6Single camera owner
A single cv_node owns the camera and publishes to shared ROS topics.

Most notably, the colour-tracking failure was the moment that committed the project to building the Transformer-based action recognition model from scratch, the single biggest design decision in 5.3 traces directly back to a few hours at this fair.

For reference, the three diagrams below show the CDE Fair prototype's internals: the simpler ROS 2 node graph (9 packages versus the final system's 10), the HSV glove-tracking CV pipeline that the Transformer model later replaced, and the state flow of the reaction drill that was used live with public visitors.

5.3.4.3 Testing Infrastructure

RI-1 to RI-8

Testing followed the classic bottom-up pyramid, many small, fast unit tests at the base, fewer cross-module integration tests in the middle, and a small number of end-to-end system checks at the top. Each level catches a different class of bug:

SYSTEM 12 sections Jupyter notebook INTEGRATION 28 tests Custom script UNIT 146 tests pytest 12 shared fixtures · 3 test files Full operational workflow hardware check · live CV · LLM · GUI · dashboard end-to-end Cross-module communication config loading · ROS message fields fusion end-to-end · motor protocol Individual components sensor fusion logic · pad constraints defence classification · database user mgmt · presets · XP · streaks few, slow, end-to-end many, fast, deterministic

Figure: Testing pyramid, 146 unit tests at the base, 28 integration tests in the middle, 12 end-to-end system sections at the top

5.3.4.4 Key Unit Tests

146 tests across 3 files with 12 shared fixtures. Key categories below (full breakdown in Appendix 7):

Test File Tests What It Covers
Sensor Fusion
test_punch_fusion.py
44 Ring buffer (7), pad-constraint filtering (10), reclassification (8), defence classification (8), session stats (4), CV+IMU matching (3), event expiry (4)
Gamification
test_gamification.py
30 Rank system (10), session XP (7), score calculation (5), achievements (5), leaderboard (3)
Database
test_database.py
72 User management (13), pattern lock (5), guest sessions (5), presets (7), training sessions (8), gamification persistence (7), plus shared fixture and edge-case tests
python3 -m pytest tests/ -v

5.3.4.5 Integration Testing

28 cross-module tests verify that components work correctly together:

What It Tests Why It Matters
Config loading & YAML validation All config files parse correctly at startup
Pad-constraint mapping Each punch type → correct pad verified end-to-end
CV + IMU fusion Temporal matching and reclassification work across modules
21 ROS message fields Every custom message type has correct fields and types
Motor protocol Punch codes 1–6 map correctly to motor commands
Reaction time logic Detection pipeline triggers correctly on depth-based approach
python3 notebooks/scripts/test_integration.py

5.3.4.6 System-Level Testing

Teensy Simulator

A GUI-based simulator emulating the Teensy 4.0 with 4 IMU sensors (4 pads), enabling full system testing without physical hardware. Supports manual mode (click to inject pad impacts with configurable force) and auto mode (predefined punch sequences for reproducible testing).

ros2 launch boxbunny_core boxbunny_dev.launch.py

The two screenshots below show two different things. On the left is the Teensy simulator window itself, click any pad or arm at any force level and the system reacts as if a real IMU registered the strike. On the right is the touchscreen GUI's preset overlay, which a user opens by hitting the head pad and navigates with left/right pad hits to browse and start a saved training preset hands-free. The video below the screenshots shows the head-pad open, left/right scroll, and centre-pad confirm flow in action.

Demo, head pad opens the preset overlay, left/right pads scroll between presets, centre pad confirms and starts the session

Hardware detection and dual-mode operation

A key design challenge was preventing the simulator from interfering with real hardware. Without awareness of whether the physical Teensy is connected, the simulator would publish its own IMU events alongside the real ones, causing double-counted punches, conflicting motor commands, and session state confusion.

The simulator solves this with a _teensy_connected flag. It subscribes to the real hardware topics (/robot/strike_detected, motor_feedback) and listens for incoming messages. When real messages are detected, the flag is set and the simulator stops publishing its own data, switching to a passive monitoring mode that allows it to run alongside real hardware for monitoring without interference. If no real messages arrive for 3 seconds, it reverts to active simulation. This gives the simulator two distinct roles:

No hardware connected (development)
Simulator actively publishes IMU events and motor commands, emulating what the real Teensy would send. This allows full end-to-end testing without physical hardware.
Hardware connected (production)
Simulator goes passive and only listens. It runs alongside real hardware for monitoring without interference, without sending duplicate commands.

Full robot system validation

The final level of testing ran the complete system on the physical robot using boxbunny_full.launch.py with the real camera, 4 IMU pads, and motors. This is the only test tier that verifies end-to-end behaviour that cannot be tested in simulation alone.

CV + Fusion + Motors
CV pipeline detecting real punches, fusion engine confirming them against real IMU impacts, and the sparring engine commanding the physical arm to strike.
Yaw + Height tracking
Yaw motor continuously tracking a real user moving laterally, and height adjustment responding correctly to user commands via the phone dashboard.
Audio + GUI + Dashboard
Round bell sounds, countdown audio, coaching tip delivery, touchscreen responsiveness with gloved hands, and phone dashboard syncing live session data.
Session lifecycle
Full idle → countdown → active → rest → complete cycle, verifying that all nodes transition correctly and session data is written to the database.

During these sessions, the Teensy simulator ran in passive mode alongside the real hardware, monitoring the commands flowing through the system without interference. When the robot threw a jab, the corresponding strike command was visible in the ROS topic stream; when the yaw motor tracked the user, the yaw commands matched the user's position. Any mismatch between the robot's physical behaviour and the commands in the pipeline would indicate a routing issue.

Note: While the full robot test verified that all subsystems work together correctly, further rigorous user testing is needed to uncover edge-case bugs and fine-tune the experience, response timings, audio levels, touchscreen usability between rounds, and coaching tip relevance across skill levels are all areas where extended real-user sessions would surface improvements that bench testing alone cannot catch. These are discussed further in Section 6.

Master Notebook (boxbunny_runner.ipynb)

A 12-section Jupyter notebook serving as the primary operational testing orchestrator:

# Section What It Does
1 Build & Setup colcon build, dependency check
2 Unit Tests Run full 146-test pytest suite
3 System Check Hardware verification (camera, CUDA, models, database)
4 Launch System Start all ROS 2 nodes + Teensy simulator + GUI
5 Stop System Graceful shutdown
6 GUI Test Visual inspection of all 24 pages
7 Phone Dashboard Start server + public tunnel + QR code
8 CV Model Live Test Camera feed with pose skeleton + action label overlay
9 LLM Coach Test Interactive AI coach chat GUI
10 Build Vue Frontend Rebuild SPA after changes
11 Sound Test Play all 18 audio effects
12 Demo Profiles User cards + percentile rankings

Launch Configurations

Command Mode When to Use
boxbunny_dev.launch.py Development Uses Teensy simulator, no physical hardware needed
boxbunny_full.launch.py Production Full system with real camera, IMU, and motors
headless.launch.py Headless Processing nodes only, no GUI (for automated testing)

5.3.4.7 Verification Summary

RI-1 to RI-8
8 / 8
Robot Intelligence requirements
All requirements verified through dedicated tests, benchmarks, and live deployment.
Req Criterion Test Method Result Status
RI-1 8 actions at ≥30 FPS TensorRT FP16 benchmark on Jetson ~24ms/frame = 42fps theoretical, 30fps sustained Pass
RI-2 Latency ≤150ms End-to-end measurement (camera → motor) ~120ms (parallel CV + ROS pipeline) Pass
RI-3 Accuracy ≥90% unseen person Holdout validation (221 samples, unseen person) 96.8% (v11 deployed) — v10 scored 97.3% on validation but v11 was deployed as it included the block annotation fix and performed better in live testing Pass
RI-4 CV+IMU fusion 44 fusion unit tests + 28 integration tests + live sparring All tests pass, >85% IMU confirmation rate Pass
RI-5 On-device inference Jetson deployment (CV + LLM simultaneous) All models fit in ~2.5GB of 16GB shared memory Pass
RI-6 Multiple AI styles 5 Markov styles tested across 3 difficulty levels All 5 styles functional with weakness tracking Pass
RI-7 Local LLM coaching Live session tip generation + fallback test Tips every ~18s, 65 fallback tips operational Pass
RI-8 Person detection & user tracking Teensy simulator robot status row (5.3.4.6); yaw and height commands verified during live sessions Yaw tracks user laterally, height data available for user-controlled adjustment via phone or GUI Pass

Table: Requirements verification summary, all 8 requirements met

5.3.4.8 Limitations

All requirements were met, but several limitations remain that scope out the natural next direction for the system. The rows highlighted below are the principal limitations, the ones that most shape what the system can and cannot do today, and the ones carried forward into the overall project discussion in Section 6.2.

Limitation Impact Mitigation / Future Work
Annotation excludes retraction Punch annotations end at full extension, so the model wobbles during the retraction phase, the IMU + pad-constraint filter in 5.3.2.1 currently absorbs these misclassifications Add an explicit retraction class so predictions stay correct end-to-end and the IMU filter becomes redundant rather than load-bearing
CV model has a limited action vocabulary The action prediction model only classifies 8 actions (jab, cross, left/right hook, left/right uppercut, block, idle), no slips, no head-vs-torso target distinction, no defensive footwork. Downstream coaching insight is bounded by what these 8 labels can express Expand the CV model's class set with slips, head/body target variants, and defensive movements, richer labels feed every other part of the system (sparring AI, analytics, LLM coach) and unlock more dynamic feedback
Pose estimator not trained for boxing YOLO Pose is pretrained on COCO, where hands and wrists are bare and clearly visible. It has no concept of what a gloved hand looks like and no training on this camera angle, so wrist keypoints are unreliable exactly when they matter most. The voxel branch currently compensates for this Fine-tune a custom pose estimator on our own recorded boxing footage with manually annotated keypoints, gloved hands, this camera viewpoint, our pose set. A pose model that was actually trained for the context would lift the load off the voxel branch and improve accuracy further
Camera field of view The RealSense D435i's field of view at the robot-mounted distance only captures the upper body. When the user steps in or extends a punch, hands can leave the frame, and the lower body (footwork, stance width, weight transfer) is never visible. These are critical coaching signals that the CV pipeline currently cannot provide Move to a wider field-of-view depth camera (or a slightly different mount angle) to capture the full body, unlocking lower-body analytics and full punch travel without giving up the simplicity of a single camera. Multi-camera was deliberately ruled out as impractical for a portable system
Manual camera tilt calibration The D435i's built-in IMU requires the hid_sensor_hub kernel module which isn't in the Tegra kernel, so the camera's tilt has to be set manually at launch via --camera-pitch Mount a small external IMU on the camera bracket (e.g. an MPU6050 wired through the existing Teensy) to read the tilt automatically and feed it into the CV node

Highlighted rows are the principal limitations carried into the project-wide discussion in Section 6.2. The two unhighlighted rows below are smaller, more contained engineering opportunities that don't materially gate the system's current capability.