Back to Robot Intelligence

5.3.4 Testing & Validation

Verifies RI-1 to RI-7
Validation CDE Fair Tests 8/8 Pass Limitations

The Robot Intelligence subsystem was validated through model performance and interpretability, early prototype deployment at the CDE AI Fair, comprehensive automated testing, and final system validation on the full robot with real hardware.

V-Model alignment

Following the Systems Engineering V-Model introduced in Section 3.2, this page covers the right side of the V, closing each RI-1 to RI-7 requirement against a concrete verification activity at three levels:

Component & cross-module. 146 unit + 28 integration tests (5.3.4.3–5.3.4.5).

Integration in the field. CDE AI Fair deployment (5.3.4.2) surfaced real-world issues on an earlier prototype.

System-level. Full robot test with real camera, IMU, and motors (5.3.4.6).

Closure. Section 5.3.4.7 ties every requirement to its result.

Validation methodology and its limits

Quantitative evidence. Model graded with hard numbers: 96.8% accuracy on an unseen person, 94.2% macro F1, ~120 ms end-to-end latency, 30 fps throughput, and 146 unit + 28 integration tests with reproducible pass/fail.

Observational evidence. Integrated behaviours (fusion, yaw tracking, reaction-time drill) validated by running the full system end-to-end and checking it matches the design. Evidence lives in the 5.3.4.5 video clips.

Out of scope. Product-grade validation (coaching-tip relevance, fusion precision against labelled ground truth, tracking angular error across users) requires a structured pilot study, flagged as future work in 5.3.4.8 and Section 6.3.

5.3.4.1 Model Validation & Interpretability

RI-3

This section presents how well the final model performs and provides visual evidence of what the model has learned internally. The goal is to verify that the model is making decisions for the right reasons, not just memorising patterns.

Overall Performance

Final Model (v11), Headline Results
96.8%
validation accuracy
on an unseen person
94.2%
macro F1-score
(class-balanced)
7 / 221
samples misclassified
on the validation set

Macro F1 averages F1 across all classes equally regardless of class size, which is a fairer measure than raw accuracy when classes are imbalanced. Model v10 reached 97.3% on the same split but had poorer live block detection, so v11 was selected for deployment after the block annotation fix.

Classification Results

The confusion matrix (left) shows where the model gets confused: each cell counts how often a true class (row) was predicted as another (column). The strong diagonal means the model is right most of the time, and the few off-diagonal cells cluster between biomechanically similar punches, left hook confused with left uppercut (both swing the left arm), and block confused with idle (both involve minimal movement).

The per-class accuracy chart (right) shows the hit rate per punch type. Most classes exceed 96%, with right hook and right uppercut at a perfect 100%. The notable exception is the block class, which is addressed in the callout below.

Block class, the hardest action to classify

Problem: block is the weakest class at 66.7%. The start of a block (arms rising from guard) looks nearly identical to the start of an uppercut (arms rising into the punch) for the first few frames.

Insight: the model only ever saw the onset of a block during training, so it had no way to disambiguate it from an uppercut without seeing the rest of the action.

Fix & live impact: the block annotation fix in 5.3.1 extended annotations to cover the full guard sequence (rise, hold, return). The validation headline number stayed in the same range, but live block reliability improved significantly, the gain came entirely from the annotation, not the model.

Training Convergence

The dashboard below tracks how the model improved during training. The loss (top-left) measures how wrong the model's predictions are, lower is better. The accuracy (top-right) shows the percentage of correct predictions on both training data (blue) and unseen validation data (pink). The gap between the two lines indicates overfitting, here the gap stays narrow, confirming the model generalises well rather than memorising training examples. The model reached its best performance at epoch 73 out of 133.

What the Model Learned (t-SNE Visualisation)

The model's internal 384-dimensional representations are compressed to 2D using t-SNE (a dimensionality-reduction method that preserves local similarity). Each dot is a punch sample coloured by its true class, if the model has learned meaningful features, same-class dots should cluster together and different classes should separate.

The plot below confirms exactly that: tight, well-separated clusters for each action class. The only clusters that partially overlap are left hook / left uppercut, the same pair that confuses the model in the confusion matrix above. The two visualisations agree.

Which Frames Matter Most

Gradient saliency measures how much the model's output changes with respect to each input, revealing which frames in the 12-frame window the model relied on most. The chart below shows a clear pattern: later frames are significantly more important than early ones. The most recent frame (index 11) scores highest, with frames 6–9 close behind, while the first few frames (indices 1–4) contribute the least.

Important, this is not the same as "predicting after the punch ends". The 12-frame window is a rolling causal window that always covers the most recent ~400 ms of motion. As a punch unfolds, that window slides forward with it, so “the latest frame in the window” is always the present moment, not the end of the punch. A new prediction is emitted every frame while the punch is still in progress.

What this chart actually says is that the model makes its decisions based on the most recent ~200 ms of motion rather than the deeper history. That is exactly what action prediction needs: the model commits to a punch type as soon as the discriminative motion appears in the recent past, not after the punch retracts. The very early frames in the window mostly capture the preparation phase before the discriminative motion has begun, so they reasonably contribute less.

Robustness to glove presence

The training dataset was recorded with users wearing boxing gloves. A natural concern is whether the model generalises to users without gloves, since a deployed system cannot assume every user will be gloved. The two clips below run live inference on the deployed model with and without gloves to test this directly.

Figure: Live inference with gloves (in-distribution)
Figure: Live inference without gloves (out-of-distribution)

Because the pipeline is built on pose estimation rather than colour-based glove tracking, inference still produces reasonable predictions on a bare-handed user despite that configuration never appearing in the training set. Accuracy does drop slightly without gloves, the glove's larger visible footprint evidently gave the model a useful cue during training, but the degradation is graceful rather than catastrophic. This validates the choice to move away from the HSV glove-tracking approach used at the CDE AI Fair: the pose-based model is robust to the thing that prototype was most brittle to. A straightforward expansion of the dataset with glove-less footage would almost certainly close the remaining gap.

Height-alignment dependency

Training data was collected with users filling the top half of the frame, so the two clips below compare live inference inside and outside that trained framing.

Figure: Aligned, predictions are accurate
Figure: Unaligned, accuracy drops sharply

Correct camera height is therefore not a comfort feature but a precondition for the CV model to operate inside its trained distribution, reinforcing the role of the adjustable height mechanism in 5.2. Real boxing involves opponents of different heights, so robustness to framing variation is carried forward as a limitation (5.3.4.8) and as future work (Section 6.3).

5.3.4.2 CDE AI Fair, Real-World Deployment

RI-1 to RI-7

As introduced in the GUI testing section (5.1), BoxBunny was deployed at the "Robotics Meets AI Showcase" in late January 2025 for public interaction. From a Robot Intelligence perspective, the key context is that the Transformer-based action prediction model was not yet ready, instead, a simpler HSV (Hue-Saturation-Value, a colour space that separates colour from brightness) colour-based glove tracking system was implemented, requiring users to wear red and green gloves. The prototype ran 9 ROS 2 packages with 3 drill modes.

Procedure

An informal field deployment with public visitors, not a controlled study. The setup is summarised in the five blocks below:

Setting
BoxBunny showcase booth at the "Robotics Meets AI Showcase", January 2025. Running live for the full duration of the fair.
Participants
Dozens of public visitors, varying ages, heights, and skill levels. A much wider range of body types and movement styles than any lab test could cover.
Task flow
30-second briefing → put on red/green tracking gloves → try the three drills (reaction, shadow sparring, defence) at their own pace.
Data collection
Live observation by the team, CV stream stability, user posture, failure modes the moment they appeared. Captured as field notes + cross-referenced against software logs at end of day.
Outcome
6 distinct, reproducible failure modes that each drove a concrete architectural change in the final system, documented in the Findings section below.

Findings

Despite being a simpler prototype, the live deployment exposed real-world challenges that directly shaped the final system. Each issue below caused a visible failure in front of public users at the fair, and each one drove a concrete change to the final architecture.

✕  What failed at the fair ✓  What we built instead
1Background clutter
People walking behind the user triggered false punch detections.
1Nearest-person tracking
The closest and largest person in frame is now prioritised.
2Colour tracking failure
HSV glove tracking confused red/green gloves with similarly coloured background objects.
2Transformer action model
Colour tracking replaced entirely with the deep model in 5.3.1.
3Variable punching stances
Fixed velocity thresholds for reaction time were unreliable across users with different stances.
3Depth-based proximity
Reaction time now measured by hand depth crossing a calibrated threshold, stance-independent.
4LLM cold start
First coaching prompt of every session paused several seconds while the model loaded.
4Lazy load at GUI startup
The model is preloaded at GUI startup so the first prompt is already warm.
5Touchscreen input friction
Typing on the 7″ touchscreen with boxing gloves was practically impossible.
5Phone dashboard
The phone dashboard (5.3.3) provides a natural mobile interface for chat and configuration.
6Camera resource contention
Multiple CV consumers each opening their own camera handle caused USB contention and dropped frames.
6Single camera owner
A single cv_node owns the camera and publishes to shared ROS topics.

Most notably, the colour-tracking failure was the moment that committed the project to building the Transformer-based action recognition model from scratch, the single biggest design decision in 5.3 traces directly back to a few hours at this fair.

For reference, the three diagrams below show the CDE Fair prototype's internals: the simpler ROS 2 node graph (9 packages versus the final system's 10), the HSV glove-tracking CV pipeline that the Transformer model later replaced, and the state flow of the reaction drill that was used live with public visitors.

5.3.4.3 Testing Infrastructure

RI-1 to RI-7

Testing followed the classic bottom-up pyramid, many small, fast unit tests at the base, fewer cross-module integration tests in the middle, and a small number of end-to-end system checks at the top. Each level catches a different class of bug:

SYSTEM 12 sections Jupyter notebook INTEGRATION 28 tests Custom script UNIT 146 tests pytest 12 shared fixtures · 3 test files Full operational workflow hardware check · live CV · LLM · GUI · dashboard end-to-end Cross-module communication config loading · ROS message fields fusion end-to-end · motor protocol Individual components sensor fusion logic · pad constraints defence classification · database user mgmt · presets · XP · streaks few, slow, end-to-end many, fast, deterministic

Figure: Testing pyramid, 146 unit tests at the base, 28 integration tests in the middle, 12 end-to-end system sections at the top

5.3.4.4 Key Unit Tests

146 tests across 3 files with 12 shared fixtures. Key categories below (full breakdown in Appendix 7):

Test File Tests What It Covers
Sensor Fusion
test_punch_fusion.py
44 Ring buffer (7), pad-constraint filtering (10), reclassification (8), defence classification (8), session stats (4), CV+IMU matching (3), event expiry (4)
Gamification
test_gamification.py
30 Rank system (10), session XP (7), score calculation (5), achievements (5), leaderboard (3)
Database
test_database.py
72 User management (13), pattern lock (5), guest sessions (5), presets (7), training sessions (8), gamification persistence (7), plus shared fixture and edge-case tests
python3 -m pytest tests/ -v

5.3.4.5 Integration Testing

28 cross-module tests verify that components work correctly together:

What It Tests Why It Matters
Config loading & YAML validation All config files parse correctly at startup
Pad-constraint mapping Each punch type → correct pad verified end-to-end
CV + IMU fusion Temporal matching and reclassification work across modules
21 ROS message fields Every custom message type has correct fields and types
Motor protocol Punch codes 1–6 map correctly to motor commands
Reaction time logic Detection pipeline triggers correctly on depth-based approach
python3 notebooks/scripts/test_integration.py

CV + IMU fusion validation

The clip below runs the full sensor-fusion pipeline from 5.3.2.1 live, with the Teensy simulator in hardware mode: the pads shown on-screen mirror real IMU impacts on the physical pads, not synthetic events. Screen layout:

A confirmed punch only emerges when a CV prediction and an IMU impact agree, which is exactly what the fusion rules in 5.3.2.1 specify.

Figure: CV + IMU fusion validation, right panel = Teensy simulator (IMU), top-left = fusion-filtered punches, bottom-left = raw CV predictions

This run is easy to quantify because the input is scripted: a fixed number of punches is thrown, and the fused output at the end of the clip shows exactly that count with no spurious additions. The IMU stream registers only the intended strikes on the pads, post-impact pad oscillation is correctly rejected as noise rather than counted as extra hits, so fusion yields 100% agreement between the scripted input and the confirmed punch output for this run. The scope is a scripted bench test rather than a user study, so this is evidence of correct end-to-end behaviour rather than a precision/recall figure across a diverse user sample (see 5.3.4.8).

Reaction-time drill validation

The reaction-time drill needs to know when the user has actually committed to a punch, not just shifted their weight or moved their guard. Rather than building a separate detection pipeline for this, the same pose-estimation stream that drives the action-prediction model is reused to derive reaction time, keeping the system to a single perception pipeline. The clip below shows this in action and also serves as a sensitivity test: genuine punches trigger the drill, but small body movements, guard adjustments, and deliberate no-moves do not, confirming the threshold is tuned well enough to separate committed punch intent from incidental motion.

Figure: Reaction-time drill, pose pipeline reuse + sensitivity limit-testing (small movements correctly do not trigger)
Bonus in the same clip: head-pad preset shortcut. The clip also shows the reaction-time drill being launched hands-free through the head-pad shortcut: hitting the head pad pops open the preset window, left / right pads scroll through saved presets, centre pad confirms and starts the drill, and hitting the head pad again hides the window. This lets a gloved user start a session without touching the touchscreen. The same shortcut is documented in more depth under 5.3.4.6 Teensy Simulator below, which shows the underlying pad-input flow being driven by the simulator.

Person-tracking with yaw rotation

CV-based person tracking is integrated with the lower-mechanism yaw motor so that the robot can turn to face the user as they move around it. The clip below demonstrates the full loop, pose detection drives a yaw target, and the robot rotates smoothly to face the detected user. The rotation speed is deliberately tuned on the slow side: fast enough that a user pivoting around the robot to find a new angle is still faced by the robot naturally, but slow enough to remain safe for a user standing close in a boxing stance. This matches the safety stance taken in the lower-mechanism design (5.2).

Figure: Person-tracking integrated with yaw motor, safe rotation speed lets the robot face the user as they move

5.3.4.6 System-Level Testing

Teensy Simulator

A GUI-based simulator emulating the Teensy 4.0 with 4 IMU sensors (4 pads), enabling full system testing without physical hardware. Supports manual mode (click to inject pad impacts with configurable force) and auto mode (predefined punch sequences for reproducible testing).

ros2 launch boxbunny_core boxbunny_dev.launch.py

The two screenshots below show two different things. On the left is the Teensy simulator window itself, click any pad or arm at any force level and the system reacts as if a real IMU registered the strike. On the right is the touchscreen GUI's preset overlay, which a user opens by hitting the head pad and navigates with left/right pad hits to browse and start a saved training preset hands-free. The video below the screenshots shows the head-pad open, left/right scroll, and centre-pad confirm flow in action.

Demo, head pad opens the preset overlay, left/right pads scroll between presets, centre pad confirms and starts the session

Hardware detection and dual-mode operation

A key design challenge was preventing the simulator from interfering with real hardware. Without awareness of whether the physical Teensy is connected, the simulator would publish its own IMU events alongside the real ones, causing double-counted punches, conflicting motor commands, and session state confusion.

The simulator solves this with a _teensy_connected flag. It subscribes to the real hardware topics (/robot/strike_detected, motor_feedback) and listens for incoming messages. When real messages are detected, the flag is set and the simulator stops publishing its own data, switching to a passive monitoring mode that allows it to run alongside real hardware for monitoring without interference. If no real messages arrive for 3 seconds, it reverts to active simulation. This gives the simulator two distinct roles:

No hardware connected (development)
Simulator actively publishes IMU events and motor commands, emulating what the real Teensy would send. This allows full end-to-end testing without physical hardware.
Hardware connected (production)
Simulator goes passive and only listens. It runs alongside real hardware for monitoring without interference, without sending duplicate commands.

Full robot system validation

The final level of testing ran the complete system on the physical robot using boxbunny_full.launch.py with the real camera, 4 IMU pads, and motors. This is the only test tier that verifies end-to-end behaviour that cannot be tested in simulation alone.

CV + Fusion + Motors
CV pipeline detecting real punches, fusion engine confirming them against real IMU impacts, and the sparring engine commanding the physical arm to strike.
Yaw + Height tracking
Yaw motor continuously tracking a real user moving laterally, and height adjustment responding correctly to user commands via the phone dashboard.
Audio + GUI + Dashboard
Round bell sounds, countdown audio, coaching tip delivery, touchscreen responsiveness with gloved hands, and phone dashboard syncing live session data.
Session lifecycle
Full idle → countdown → active → rest → complete cycle, verifying that all nodes transition correctly and session data is written to the database.

During these sessions, the Teensy simulator ran in passive mode alongside the real hardware, monitoring the commands flowing through the system without interference. When the robot threw a jab, the corresponding strike command was visible in the ROS topic stream; when the yaw motor tracked the user, the yaw commands matched the user's position. Any mismatch between the robot's physical behaviour and the commands in the pipeline would indicate a routing issue.

Note: While the full robot test verified that all subsystems work together correctly, further rigorous user testing is needed to uncover edge-case bugs and fine-tune the experience, response timings, audio levels, touchscreen usability between rounds, and coaching tip relevance across skill levels are all areas where extended real-user sessions would surface improvements that bench testing alone cannot catch. These are discussed further in Section 6.

Master Notebook (boxbunny_runner.ipynb)

A 12-section Jupyter notebook serving as the primary operational testing orchestrator:

# Section What It Does
1 Build & Setup colcon build, dependency check
2 Unit Tests Run full 146-test pytest suite
3 System Check Hardware verification (camera, CUDA, models, database)
4 Launch System Start all ROS 2 nodes + Teensy simulator + GUI
5 Stop System Graceful shutdown
6 GUI Test Visual inspection of all 24 pages
7 Phone Dashboard Start server + public tunnel + QR code
8 CV Model Live Test Camera feed with pose skeleton + action label overlay
9 LLM Coach Test Interactive AI coach chat GUI
10 Build Vue Frontend Rebuild SPA after changes
11 Sound Test Play all 18 audio effects
12 Demo Profiles User cards + percentile rankings

Launch Configurations

Command Mode When to Use
boxbunny_dev.launch.py Development Uses Teensy simulator, no physical hardware needed
boxbunny_full.launch.py Production Full system with real camera, IMU, and motors
headless.launch.py Headless Processing nodes only, no GUI (for automated testing)

5.3.4.7 Verification Summary

RI-1 to RI-7
8 / 8
Robot Intelligence requirements
All requirements verified through dedicated tests, benchmarks, and live deployment.
Req Criterion Test Method Result Status
RI-1 8 actions at ≥30 FPS TensorRT FP16 benchmark on Jetson ~24ms/frame = 42fps theoretical, 30fps sustained Pass
RI-2 Latency ≤150ms End-to-end measurement (camera → motor) ~120ms (parallel CV + ROS pipeline) Pass
RI-3 Accuracy ≥90% unseen person Holdout validation (221 samples, unseen person) 96.8% (v11 deployed). v10 scored 97.3% on validation but v11 was deployed as it included the block annotation fix and performed better in live testing Pass
RI-4 CV+IMU fusion 44 fusion unit tests + 28 integration tests + live sparring All tests pass, >85% IMU confirmation rate Pass
RI-5 On-device inference Jetson deployment (CV + LLM simultaneous) All models fit in ~2.5GB of 16GB shared memory Pass
RI-6 Multiple AI styles 5 Markov styles tested across 3 difficulty levels All 5 styles functional with weakness tracking Pass
RI-7 Local LLM coaching Live session tip generation + fallback test Tips every ~18s, 65 fallback tips operational Pass

Table: Requirements verification summary, all 7 requirements met

5.3.4.8 Limitations

All requirements were met, but the limitations below scope out the natural next direction for the system. Each is carried forward into the project-wide discussion in Section 6.2, with the corresponding future-work items in Section 6.3.

Limitation Impact Mitigation / Future Work
CV model training scope is narrow (8 classes, no retraction) The action prediction model only labels 6 punch types, block, and idle, and its annotations stop at full extension. This caps what can be recognised (no slips, no head-vs-body targets, no footwork) and forces the IMU + pad-constraint filter in 5.3.2.1 to absorb retraction-phase wobble rather than the CV model being correct end-to-end Add an explicit retraction class and extend the action set with slips, head/body targets, and defensive footwork. Richer labels feed every downstream module (sparring AI, analytics, LLM Coach)
Pose estimator not trained for boxing YOLO Pose is pretrained on COCO (bare hands) and has no concept of gloved hands or this camera angle, so wrist keypoints become unreliable at punch extension. The voxel branch currently compensates for this Fine-tune a custom pose estimator on our own gloved-hand footage. Scoped out of the current prototype because it requires a large volume of recorded footage and frame-by-frame keypoint annotation; natural next investment once the 8-class action set is extended
Camera field of view limits full-body analysis The RealSense D435i's mounted field of view only captures the upper body. Footwork, stance width, and weight transfer, all critical coaching signals, are invisible to the CV pipeline Move to a wider-angle depth camera (or adjusted mount angle) to capture the full body while keeping the single-camera portability of the current design
CV accuracy depends on user framing Accuracy drops sharply when the user is not height-aligned to the camera (see 5.3.4.1). The height mechanism must be manually re-aligned per user, and sparring against differently-heighted opponents is not possible without re-aligning between rounds Diversify training footage across camera pitches and mount offsets, and longer-term, estimate user height from CV and drive the height mechanism automatically, removing the manual alignment step

Validation gap (not a system limitation). Separately from the table above, the integrated-system behaviours validated observationally in 5.3.4.5 (fusion, yaw tracking, reaction-time drill) and the LLM Coach's post-session analysis have not yet been tested against a diverse user sample under controlled conditions. Coaching-tip relevance, fusion precision, tracking angular error, and reaction-time false-positive rates therefore remain qualitatively rather than quantitatively characterised. The proposed structured pilot to close this gap is in Section 6.3.