5.3.4 Testing & Validation
The Robot Intelligence subsystem was validated through model performance and interpretability, early prototype deployment at the CDE AI Fair, comprehensive automated testing, and final system validation on the full robot with real hardware.
Following the Systems Engineering V-Model introduced in Section 3.2, this page covers the right side of the V, closing each RI-1 to RI-7 requirement against a concrete verification activity at three levels:
Component & cross-module. 146 unit + 28 integration tests (5.3.4.3–5.3.4.5).
Integration in the field. CDE AI Fair deployment (5.3.4.2) surfaced real-world issues on an earlier prototype.
System-level. Full robot test with real camera, IMU, and motors (5.3.4.6).
Closure. Section 5.3.4.7 ties every requirement to its result.
Quantitative evidence. Model graded with hard numbers: 96.8% accuracy on an unseen person, 94.2% macro F1, ~120 ms end-to-end latency, 30 fps throughput, and 146 unit + 28 integration tests with reproducible pass/fail.
Observational evidence. Integrated behaviours (fusion, yaw tracking, reaction-time drill) validated by running the full system end-to-end and checking it matches the design. Evidence lives in the 5.3.4.5 video clips.
Out of scope. Product-grade validation (coaching-tip relevance, fusion precision against labelled ground truth, tracking angular error across users) requires a structured pilot study, flagged as future work in 5.3.4.8 and Section 6.3.
5.3.4.1 Model Validation & Interpretability
This section presents how well the final model performs and provides visual evidence of what the model has learned internally. The goal is to verify that the model is making decisions for the right reasons, not just memorising patterns.
Overall Performance
on an unseen person
(class-balanced)
on the validation set
Macro F1 averages F1 across all classes equally regardless of class size, which is a fairer measure than raw accuracy when classes are imbalanced. Model v10 reached 97.3% on the same split but had poorer live block detection, so v11 was selected for deployment after the block annotation fix.
Classification Results
The confusion matrix (left) shows where the model gets confused: each cell counts how often a true class (row) was predicted as another (column). The strong diagonal means the model is right most of the time, and the few off-diagonal cells cluster between biomechanically similar punches, left hook confused with left uppercut (both swing the left arm), and block confused with idle (both involve minimal movement).
The per-class accuracy chart (right) shows the hit rate per punch type. Most classes exceed 96%, with right hook and right uppercut at a perfect 100%. The notable exception is the block class, which is addressed in the callout below.
Problem: block is the weakest class at 66.7%. The start of a block (arms rising from guard) looks nearly identical to the start of an uppercut (arms rising into the punch) for the first few frames.
Insight: the model only ever saw the onset of a block during training, so it had no way to disambiguate it from an uppercut without seeing the rest of the action.
Fix & live impact: the block annotation fix in 5.3.1 extended annotations to cover the full guard sequence (rise, hold, return). The validation headline number stayed in the same range, but live block reliability improved significantly, the gain came entirely from the annotation, not the model.
Training Convergence
The dashboard below tracks how the model improved during training. The loss (top-left) measures how wrong the model's predictions are, lower is better. The accuracy (top-right) shows the percentage of correct predictions on both training data (blue) and unseen validation data (pink). The gap between the two lines indicates overfitting, here the gap stays narrow, confirming the model generalises well rather than memorising training examples. The model reached its best performance at epoch 73 out of 133.
What the Model Learned (t-SNE Visualisation)
The model's internal 384-dimensional representations are compressed to 2D using t-SNE (a dimensionality-reduction method that preserves local similarity). Each dot is a punch sample coloured by its true class, if the model has learned meaningful features, same-class dots should cluster together and different classes should separate.
The plot below confirms exactly that: tight, well-separated clusters for each action class. The only clusters that partially overlap are left hook / left uppercut, the same pair that confuses the model in the confusion matrix above. The two visualisations agree.
Which Frames Matter Most
Gradient saliency measures how much the model's output changes with respect to each input, revealing which frames in the 12-frame window the model relied on most. The chart below shows a clear pattern: later frames are significantly more important than early ones. The most recent frame (index 11) scores highest, with frames 6–9 close behind, while the first few frames (indices 1–4) contribute the least.
What this chart actually says is that the model makes its decisions based on the most recent ~200 ms of motion rather than the deeper history. That is exactly what action prediction needs: the model commits to a punch type as soon as the discriminative motion appears in the recent past, not after the punch retracts. The very early frames in the window mostly capture the preparation phase before the discriminative motion has begun, so they reasonably contribute less.
Robustness to glove presence
The training dataset was recorded with users wearing boxing gloves. A natural concern is whether the model generalises to users without gloves, since a deployed system cannot assume every user will be gloved. The two clips below run live inference on the deployed model with and without gloves to test this directly.
Because the pipeline is built on pose estimation rather than colour-based glove tracking, inference still produces reasonable predictions on a bare-handed user despite that configuration never appearing in the training set. Accuracy does drop slightly without gloves, the glove's larger visible footprint evidently gave the model a useful cue during training, but the degradation is graceful rather than catastrophic. This validates the choice to move away from the HSV glove-tracking approach used at the CDE AI Fair: the pose-based model is robust to the thing that prototype was most brittle to. A straightforward expansion of the dataset with glove-less footage would almost certainly close the remaining gap.
Height-alignment dependency
Training data was collected with users filling the top half of the frame, so the two clips below compare live inference inside and outside that trained framing.
Correct camera height is therefore not a comfort feature but a precondition for the CV model to operate inside its trained distribution, reinforcing the role of the adjustable height mechanism in 5.2. Real boxing involves opponents of different heights, so robustness to framing variation is carried forward as a limitation (5.3.4.8) and as future work (Section 6.3).
5.3.4.2 CDE AI Fair, Real-World Deployment
As introduced in the GUI testing section (5.1), BoxBunny was deployed at the "Robotics Meets AI Showcase" in late January 2025 for public interaction. From a Robot Intelligence perspective, the key context is that the Transformer-based action prediction model was not yet ready, instead, a simpler HSV (Hue-Saturation-Value, a colour space that separates colour from brightness) colour-based glove tracking system was implemented, requiring users to wear red and green gloves. The prototype ran 9 ROS 2 packages with 3 drill modes.
Procedure
An informal field deployment with public visitors, not a controlled study. The setup is summarised in the five blocks below:
Findings
Despite being a simpler prototype, the live deployment exposed real-world challenges that directly shaped the final system. Each issue below caused a visible failure in front of public users at the fair, and each one drove a concrete change to the final architecture.
| ✕ What failed at the fair | ✓ What we built instead |
|---|---|
1Background clutter People walking behind the user triggered false punch detections. |
1Nearest-person tracking The closest and largest person in frame is now prioritised. |
2Colour tracking failure HSV glove tracking confused red/green gloves with similarly coloured background objects. |
2Transformer action model Colour tracking replaced entirely with the deep model in 5.3.1. |
3Variable punching stances Fixed velocity thresholds for reaction time were unreliable across users with different stances. |
3Depth-based proximity Reaction time now measured by hand depth crossing a calibrated threshold, stance-independent. |
4LLM cold start First coaching prompt of every session paused several seconds while the model loaded. |
4Lazy load at GUI startup The model is preloaded at GUI startup so the first prompt is already warm. |
5Touchscreen input friction Typing on the 7″ touchscreen with boxing gloves was practically impossible. |
5Phone dashboard The phone dashboard (5.3.3) provides a natural mobile interface for chat and configuration. |
6Camera resource contention Multiple CV consumers each opening their own camera handle caused USB contention and dropped frames. |
6Single camera owner A single cv_node owns the camera and publishes to shared ROS topics. |
Most notably, the colour-tracking failure was the moment that committed the project to building the Transformer-based action recognition model from scratch, the single biggest design decision in 5.3 traces directly back to a few hours at this fair.
For reference, the three diagrams below show the CDE Fair prototype's internals: the simpler ROS 2 node graph (9 packages versus the final system's 10), the HSV glove-tracking CV pipeline that the Transformer model later replaced, and the state flow of the reaction drill that was used live with public visitors.
5.3.4.3 Testing Infrastructure
Testing followed the classic bottom-up pyramid, many small, fast unit tests at the base, fewer cross-module integration tests in the middle, and a small number of end-to-end system checks at the top. Each level catches a different class of bug:
Figure: Testing pyramid, 146 unit tests at the base, 28 integration tests in the middle, 12 end-to-end system sections at the top
5.3.4.4 Key Unit Tests
146 tests across 3 files with 12 shared fixtures. Key categories below (full breakdown in Appendix 7):
| Test File | Tests | What It Covers |
|---|---|---|
Sensor Fusiontest_punch_fusion.py |
44 | Ring buffer (7), pad-constraint filtering (10), reclassification (8), defence classification (8), session stats (4), CV+IMU matching (3), event expiry (4) |
Gamificationtest_gamification.py |
30 | Rank system (10), session XP (7), score calculation (5), achievements (5), leaderboard (3) |
Databasetest_database.py |
72 | User management (13), pattern lock (5), guest sessions (5), presets (7), training sessions (8), gamification persistence (7), plus shared fixture and edge-case tests |
python3 -m pytest tests/ -v
5.3.4.5 Integration Testing
28 cross-module tests verify that components work correctly together:
| What It Tests | Why It Matters |
|---|---|
| Config loading & YAML validation | All config files parse correctly at startup |
| Pad-constraint mapping | Each punch type → correct pad verified end-to-end |
| CV + IMU fusion | Temporal matching and reclassification work across modules |
| 21 ROS message fields | Every custom message type has correct fields and types |
| Motor protocol | Punch codes 1–6 map correctly to motor commands |
| Reaction time logic | Detection pipeline triggers correctly on depth-based approach |
python3 notebooks/scripts/test_integration.py
CV + IMU fusion validation
The clip below runs the full sensor-fusion pipeline from 5.3.2.1 live, with the Teensy simulator in hardware mode: the pads shown on-screen mirror real IMU impacts on the physical pads, not synthetic events. Screen layout:
- Right: Teensy GUI, pads light up from live IMU hits.
- Bottom-left: raw CV predictions, before fusion.
- Top-left: fusion-filtered punch stream, the output.
A confirmed punch only emerges when a CV prediction and an IMU impact agree, which is exactly what the fusion rules in 5.3.2.1 specify.
This run is easy to quantify because the input is scripted: a fixed number of punches is thrown, and the fused output at the end of the clip shows exactly that count with no spurious additions. The IMU stream registers only the intended strikes on the pads, post-impact pad oscillation is correctly rejected as noise rather than counted as extra hits, so fusion yields 100% agreement between the scripted input and the confirmed punch output for this run. The scope is a scripted bench test rather than a user study, so this is evidence of correct end-to-end behaviour rather than a precision/recall figure across a diverse user sample (see 5.3.4.8).
Reaction-time drill validation
The reaction-time drill needs to know when the user has actually committed to a punch, not just shifted their weight or moved their guard. Rather than building a separate detection pipeline for this, the same pose-estimation stream that drives the action-prediction model is reused to derive reaction time, keeping the system to a single perception pipeline. The clip below shows this in action and also serves as a sensitivity test: genuine punches trigger the drill, but small body movements, guard adjustments, and deliberate no-moves do not, confirming the threshold is tuned well enough to separate committed punch intent from incidental motion.
Person-tracking with yaw rotation
CV-based person tracking is integrated with the lower-mechanism yaw motor so that the robot can turn to face the user as they move around it. The clip below demonstrates the full loop, pose detection drives a yaw target, and the robot rotates smoothly to face the detected user. The rotation speed is deliberately tuned on the slow side: fast enough that a user pivoting around the robot to find a new angle is still faced by the robot naturally, but slow enough to remain safe for a user standing close in a boxing stance. This matches the safety stance taken in the lower-mechanism design (5.2).
5.3.4.6 System-Level Testing
Teensy Simulator
A GUI-based simulator emulating the Teensy 4.0 with 4 IMU sensors (4 pads), enabling full system testing without physical hardware. Supports manual mode (click to inject pad impacts with configurable force) and auto mode (predefined punch sequences for reproducible testing).
ros2 launch boxbunny_core boxbunny_dev.launch.py
The two screenshots below show two different things. On the left is the Teensy simulator window itself, click any pad or arm at any force level and the system reacts as if a real IMU registered the strike. On the right is the touchscreen GUI's preset overlay, which a user opens by hitting the head pad and navigates with left/right pad hits to browse and start a saved training preset hands-free. The video below the screenshots shows the head-pad open, left/right scroll, and centre-pad confirm flow in action.
Hardware detection and dual-mode operation
A key design challenge was preventing the simulator from interfering with real hardware. Without awareness of whether the physical Teensy is connected, the simulator would publish its own IMU events alongside the real ones, causing double-counted punches, conflicting motor commands, and session state confusion.
The simulator solves this with a _teensy_connected flag. It subscribes to
the real hardware topics (/robot/strike_detected, motor_feedback)
and listens for incoming messages. When real messages are detected, the flag is set and
the simulator stops publishing its own data, switching to a passive monitoring
mode that allows it to run alongside real hardware for monitoring without
interference. If no real messages arrive for 3 seconds, it reverts to active simulation.
This gives the simulator two distinct roles:
Full robot system validation
The final level of testing ran the complete system on the physical robot
using boxbunny_full.launch.py with the real camera, 4 IMU pads, and
motors. This is the only test tier that verifies end-to-end behaviour that
cannot be tested in simulation alone.
During these sessions, the Teensy simulator ran in passive mode alongside the real hardware, monitoring the commands flowing through the system without interference. When the robot threw a jab, the corresponding strike command was visible in the ROS topic stream; when the yaw motor tracked the user, the yaw commands matched the user's position. Any mismatch between the robot's physical behaviour and the commands in the pipeline would indicate a routing issue.
Note: While the full robot test verified that all subsystems work together correctly, further rigorous user testing is needed to uncover edge-case bugs and fine-tune the experience, response timings, audio levels, touchscreen usability between rounds, and coaching tip relevance across skill levels are all areas where extended real-user sessions would surface improvements that bench testing alone cannot catch. These are discussed further in Section 6.
Master Notebook (boxbunny_runner.ipynb)
A 12-section Jupyter notebook serving as the primary operational testing orchestrator:
| # | Section | What It Does |
|---|---|---|
| 1 | Build & Setup | colcon build, dependency check |
| 2 | Unit Tests | Run full 146-test pytest suite |
| 3 | System Check | Hardware verification (camera, CUDA, models, database) |
| 4 | Launch System | Start all ROS 2 nodes + Teensy simulator + GUI |
| 5 | Stop System | Graceful shutdown |
| 6 | GUI Test | Visual inspection of all 24 pages |
| 7 | Phone Dashboard | Start server + public tunnel + QR code |
| 8 | CV Model Live Test | Camera feed with pose skeleton + action label overlay |
| 9 | LLM Coach Test | Interactive AI coach chat GUI |
| 10 | Build Vue Frontend | Rebuild SPA after changes |
| 11 | Sound Test | Play all 18 audio effects |
| 12 | Demo Profiles | User cards + percentile rankings |
Launch Configurations
| Command | Mode | When to Use |
|---|---|---|
boxbunny_dev.launch.py |
Development | Uses Teensy simulator, no physical hardware needed |
boxbunny_full.launch.py |
Production | Full system with real camera, IMU, and motors |
headless.launch.py |
Headless | Processing nodes only, no GUI (for automated testing) |
5.3.4.7 Verification Summary
| Req | Criterion | Test Method | Result | Status |
|---|---|---|---|---|
| RI-1 | 8 actions at ≥30 FPS | TensorRT FP16 benchmark on Jetson | ~24ms/frame = 42fps theoretical, 30fps sustained | Pass |
| RI-2 | Latency ≤150ms | End-to-end measurement (camera → motor) | ~120ms (parallel CV + ROS pipeline) | Pass |
| RI-3 | Accuracy ≥90% unseen person | Holdout validation (221 samples, unseen person) | 96.8% (v11 deployed). v10 scored 97.3% on validation but v11 was deployed as it included the block annotation fix and performed better in live testing | Pass |
| RI-4 | CV+IMU fusion | 44 fusion unit tests + 28 integration tests + live sparring | All tests pass, >85% IMU confirmation rate | Pass |
| RI-5 | On-device inference | Jetson deployment (CV + LLM simultaneous) | All models fit in ~2.5GB of 16GB shared memory | Pass |
| RI-6 | Multiple AI styles | 5 Markov styles tested across 3 difficulty levels | All 5 styles functional with weakness tracking | Pass |
| RI-7 | Local LLM coaching | Live session tip generation + fallback test | Tips every ~18s, 65 fallback tips operational | Pass |
Table: Requirements verification summary, all 7 requirements met
5.3.4.8 Limitations
All requirements were met, but the limitations below scope out the natural next direction for the system. Each is carried forward into the project-wide discussion in Section 6.2, with the corresponding future-work items in Section 6.3.
| Limitation | Impact | Mitigation / Future Work |
|---|---|---|
| CV model training scope is narrow (8 classes, no retraction) | The action prediction model only labels 6 punch types, block, and idle, and its annotations stop at full extension. This caps what can be recognised (no slips, no head-vs-body targets, no footwork) and forces the IMU + pad-constraint filter in 5.3.2.1 to absorb retraction-phase wobble rather than the CV model being correct end-to-end | Add an explicit retraction class and extend the action set with slips, head/body targets, and defensive footwork. Richer labels feed every downstream module (sparring AI, analytics, LLM Coach) |
| Pose estimator not trained for boxing | YOLO Pose is pretrained on COCO (bare hands) and has no concept of gloved hands or this camera angle, so wrist keypoints become unreliable at punch extension. The voxel branch currently compensates for this | Fine-tune a custom pose estimator on our own gloved-hand footage. Scoped out of the current prototype because it requires a large volume of recorded footage and frame-by-frame keypoint annotation; natural next investment once the 8-class action set is extended |
| Camera field of view limits full-body analysis | The RealSense D435i's mounted field of view only captures the upper body. Footwork, stance width, and weight transfer, all critical coaching signals, are invisible to the CV pipeline | Move to a wider-angle depth camera (or adjusted mount angle) to capture the full body while keeping the single-camera portability of the current design |
| CV accuracy depends on user framing | Accuracy drops sharply when the user is not height-aligned to the camera (see 5.3.4.1). The height mechanism must be manually re-aligned per user, and sparring against differently-heighted opponents is not possible without re-aligning between rounds | Diversify training footage across camera pitches and mount offsets, and longer-term, estimate user height from CV and drive the height mechanism automatically, removing the manual alignment step |
Validation gap (not a system limitation). Separately from the table above, the integrated-system behaviours validated observationally in 5.3.4.5 (fusion, yaw tracking, reaction-time drill) and the LLM Coach's post-session analysis have not yet been tested against a diverse user sample under controlled conditions. Coaching-tip relevance, fusion precision, tracking angular error, and reaction-time false-positive rates therefore remain qualitatively rather than quantitatively characterised. The proposed structured pilot to close this gap is in Section 6.3.