5.3.4 Testing & Validation
The Robot Intelligence subsystem was validated through model performance and interpretability, early prototype deployment at the CDE AI Fair, comprehensive automated testing, and final system validation on the full robot with real hardware.
V-Model alignment. Following the Systems Engineering V-Model introduced in Section 3.2, this page corresponds to the right side of the V, where each Robot Intelligence requirement (RI-1 through RI-8) is closed against a concrete verification activity. Unit and integration tests verify the lower levels of the V (individual components and cross-module communication), the CDE AI Fair served as an integration-level test that surfaced real-world issues, and the full robot test with real camera, IMU, and motors closes the system-level verification. Section 5.3.4.7 Verification Summary ties every requirement to its result.
5.3.4.1 Model Validation & Interpretability
This section presents how well the final model performs and provides visual evidence of what the model has learned internally. The goal is to verify that the model is making decisions for the right reasons, not just memorising patterns.
Overall Performance
on an unseen person
(class-balanced)
on the validation set
Macro F1 averages F1 across all classes equally regardless of class size, which is a fairer measure than raw accuracy when classes are imbalanced. Model v10 reached 97.3% on the same split but had poorer live block detection, so v11 was selected for deployment after the block annotation fix.
Classification Results
The confusion matrix (left) shows where the model gets confused: each cell counts how often a true class (row) was predicted as another (column). The strong diagonal means the model is right most of the time, and the few off-diagonal cells cluster between biomechanically similar punches, left hook confused with left uppercut (both swing the left arm), and block confused with idle (both involve minimal movement).
The per-class accuracy chart (right) shows the hit rate per punch type. Most classes exceed 96%, with right hook and right uppercut at a perfect 100%. The notable exception is the block class, which is addressed in the callout below.
Problem: block is the weakest class at 66.7%. The start of a block (arms rising from guard) looks nearly identical to the start of an uppercut (arms rising into the punch) for the first few frames.
Insight: the model only ever saw the onset of a block during training, so it had no way to disambiguate it from an uppercut without seeing the rest of the action.
Fix & live impact: the block annotation fix in 5.3.1 extended annotations to cover the full guard sequence (rise, hold, return). The validation headline number stayed in the same range, but live block reliability improved significantly, the gain came entirely from the annotation, not the model.
Training Convergence
The dashboard below tracks how the model improved during training. The loss (top-left) measures how wrong the model's predictions are, lower is better. The accuracy (top-right) shows the percentage of correct predictions on both training data (blue) and unseen validation data (pink). The gap between the two lines indicates overfitting, here the gap stays narrow, confirming the model generalises well rather than memorising training examples. The model reached its best performance at epoch 73 out of 133.
What the Model Learned (t-SNE Visualisation)
The model's internal 384-dimensional representations are compressed to 2D using t-SNE (a dimensionality-reduction method that preserves local similarity). Each dot is a punch sample coloured by its true class, if the model has learned meaningful features, same-class dots should cluster together and different classes should separate.
The plot below confirms exactly that: tight, well-separated clusters for each action class. The only clusters that partially overlap are left hook / left uppercut, the same pair that confuses the model in the confusion matrix above. The two visualisations agree.
Which Frames Matter Most
Gradient saliency measures how much the model's output changes with respect to each input, revealing which frames in the 12-frame window the model relied on most. The chart below shows a clear pattern: later frames are significantly more important than early ones. The most recent frame (index 11) scores highest, with frames 6–9 close behind, while the first few frames (indices 1–4) contribute the least.
What this chart actually says is that the model makes its decisions based on the most recent ~200 ms of motion rather than the deeper history. That is exactly what action prediction needs: the model commits to a punch type as soon as the discriminative motion appears in the recent past, not after the punch retracts. The very early frames in the window mostly capture the preparation phase before the discriminative motion has begun, so they reasonably contribute less.
5.3.4.2 CDE AI Fair, Real-World Deployment
As introduced in the GUI testing section (5.1), BoxBunny was deployed at the "Robotics Meets AI Showcase" in late January 2025 for public interaction. From a Robot Intelligence perspective, the key context is that the Transformer-based action prediction model was not yet ready, instead, a simpler HSV (Hue-Saturation-Value, a colour space that separates colour from brightness) colour-based glove tracking system was implemented, requiring users to wear red and green gloves. The prototype ran 9 ROS 2 packages with 3 drill modes.
Procedure
An informal field deployment with public visitors, not a controlled study. The setup is summarised in the five blocks below:
Findings
Despite being a simpler prototype, the live deployment exposed real-world challenges that directly shaped the final system. Each issue below caused a visible failure in front of public users at the fair, and each one drove a concrete change to the final architecture.
| ✕ What failed at the fair | ✓ What we built instead |
|---|---|
1Background clutter People walking behind the user triggered false punch detections. |
1Nearest-person tracking The closest and largest person in frame is now prioritised. |
2Colour tracking failure HSV glove tracking confused red/green gloves with similarly coloured background objects. |
2Transformer action model Colour tracking replaced entirely with the deep model in 5.3.1. |
3Variable punching stances Fixed velocity thresholds for reaction time were unreliable across users with different stances. |
3Depth-based proximity Reaction time now measured by hand depth crossing a calibrated threshold, stance-independent. |
4LLM cold start First coaching prompt of every session paused several seconds while the model loaded. |
4Lazy load at GUI startup The model is preloaded at GUI startup so the first prompt is already warm. |
5Touchscreen input friction Typing on the 7″ touchscreen with boxing gloves was practically impossible. |
5Phone dashboard The phone dashboard (5.3.3) provides a natural mobile interface for chat and configuration. |
6Camera resource contention Multiple CV consumers each opening their own camera handle caused USB contention and dropped frames. |
6Single camera owner A single cv_node owns the camera and publishes to shared ROS topics. |
Most notably, the colour-tracking failure was the moment that committed the project to building the Transformer-based action recognition model from scratch, the single biggest design decision in 5.3 traces directly back to a few hours at this fair.
For reference, the three diagrams below show the CDE Fair prototype's internals: the simpler ROS 2 node graph (9 packages versus the final system's 10), the HSV glove-tracking CV pipeline that the Transformer model later replaced, and the state flow of the reaction drill that was used live with public visitors.
5.3.4.3 Testing Infrastructure
Testing followed the classic bottom-up pyramid, many small, fast unit tests at the base, fewer cross-module integration tests in the middle, and a small number of end-to-end system checks at the top. Each level catches a different class of bug:
Figure: Testing pyramid, 146 unit tests at the base, 28 integration tests in the middle, 12 end-to-end system sections at the top
5.3.4.4 Key Unit Tests
146 tests across 3 files with 12 shared fixtures. Key categories below (full breakdown in Appendix 7):
| Test File | Tests | What It Covers |
|---|---|---|
Sensor Fusiontest_punch_fusion.py |
44 | Ring buffer (7), pad-constraint filtering (10), reclassification (8), defence classification (8), session stats (4), CV+IMU matching (3), event expiry (4) |
Gamificationtest_gamification.py |
30 | Rank system (10), session XP (7), score calculation (5), achievements (5), leaderboard (3) |
Databasetest_database.py |
72 | User management (13), pattern lock (5), guest sessions (5), presets (7), training sessions (8), gamification persistence (7), plus shared fixture and edge-case tests |
python3 -m pytest tests/ -v
5.3.4.5 Integration Testing
28 cross-module tests verify that components work correctly together:
| What It Tests | Why It Matters |
|---|---|
| Config loading & YAML validation | All config files parse correctly at startup |
| Pad-constraint mapping | Each punch type → correct pad verified end-to-end |
| CV + IMU fusion | Temporal matching and reclassification work across modules |
| 21 ROS message fields | Every custom message type has correct fields and types |
| Motor protocol | Punch codes 1–6 map correctly to motor commands |
| Reaction time logic | Detection pipeline triggers correctly on depth-based approach |
python3 notebooks/scripts/test_integration.py
5.3.4.6 System-Level Testing
Teensy Simulator
A GUI-based simulator emulating the Teensy 4.0 with 4 IMU sensors (4 pads), enabling full system testing without physical hardware. Supports manual mode (click to inject pad impacts with configurable force) and auto mode (predefined punch sequences for reproducible testing).
ros2 launch boxbunny_core boxbunny_dev.launch.py
The two screenshots below show two different things. On the left is the Teensy simulator window itself, click any pad or arm at any force level and the system reacts as if a real IMU registered the strike. On the right is the touchscreen GUI's preset overlay, which a user opens by hitting the head pad and navigates with left/right pad hits to browse and start a saved training preset hands-free. The video below the screenshots shows the head-pad open, left/right scroll, and centre-pad confirm flow in action.
Hardware detection and dual-mode operation
A key design challenge was preventing the simulator from interfering with real hardware. Without awareness of whether the physical Teensy is connected, the simulator would publish its own IMU events alongside the real ones, causing double-counted punches, conflicting motor commands, and session state confusion.
The simulator solves this with a _teensy_connected flag. It subscribes to
the real hardware topics (/robot/strike_detected, motor_feedback)
and listens for incoming messages. When real messages are detected, the flag is set and
the simulator stops publishing its own data, switching to a passive monitoring
mode that allows it to run alongside real hardware for monitoring without
interference. If no real messages arrive for 3 seconds, it reverts to active simulation.
This gives the simulator two distinct roles:
Full robot system validation
The final level of testing ran the complete system on the physical robot
using boxbunny_full.launch.py with the real camera, 4 IMU pads, and
motors. This is the only test tier that verifies end-to-end behaviour that
cannot be tested in simulation alone.
During these sessions, the Teensy simulator ran in passive mode alongside the real hardware, monitoring the commands flowing through the system without interference. When the robot threw a jab, the corresponding strike command was visible in the ROS topic stream; when the yaw motor tracked the user, the yaw commands matched the user's position. Any mismatch between the robot's physical behaviour and the commands in the pipeline would indicate a routing issue.
Note: While the full robot test verified that all subsystems work together correctly, further rigorous user testing is needed to uncover edge-case bugs and fine-tune the experience, response timings, audio levels, touchscreen usability between rounds, and coaching tip relevance across skill levels are all areas where extended real-user sessions would surface improvements that bench testing alone cannot catch. These are discussed further in Section 6.
Master Notebook (boxbunny_runner.ipynb)
A 12-section Jupyter notebook serving as the primary operational testing orchestrator:
| # | Section | What It Does |
|---|---|---|
| 1 | Build & Setup | colcon build, dependency check |
| 2 | Unit Tests | Run full 146-test pytest suite |
| 3 | System Check | Hardware verification (camera, CUDA, models, database) |
| 4 | Launch System | Start all ROS 2 nodes + Teensy simulator + GUI |
| 5 | Stop System | Graceful shutdown |
| 6 | GUI Test | Visual inspection of all 24 pages |
| 7 | Phone Dashboard | Start server + public tunnel + QR code |
| 8 | CV Model Live Test | Camera feed with pose skeleton + action label overlay |
| 9 | LLM Coach Test | Interactive AI coach chat GUI |
| 10 | Build Vue Frontend | Rebuild SPA after changes |
| 11 | Sound Test | Play all 18 audio effects |
| 12 | Demo Profiles | User cards + percentile rankings |
Launch Configurations
| Command | Mode | When to Use |
|---|---|---|
boxbunny_dev.launch.py |
Development | Uses Teensy simulator, no physical hardware needed |
boxbunny_full.launch.py |
Production | Full system with real camera, IMU, and motors |
headless.launch.py |
Headless | Processing nodes only, no GUI (for automated testing) |
5.3.4.7 Verification Summary
| Req | Criterion | Test Method | Result | Status |
|---|---|---|---|---|
| RI-1 | 8 actions at ≥30 FPS | TensorRT FP16 benchmark on Jetson | ~24ms/frame = 42fps theoretical, 30fps sustained | Pass |
| RI-2 | Latency ≤150ms | End-to-end measurement (camera → motor) | ~120ms (parallel CV + ROS pipeline) | Pass |
| RI-3 | Accuracy ≥90% unseen person | Holdout validation (221 samples, unseen person) | 96.8% (v11 deployed) — v10 scored 97.3% on validation but v11 was deployed as it included the block annotation fix and performed better in live testing | Pass |
| RI-4 | CV+IMU fusion | 44 fusion unit tests + 28 integration tests + live sparring | All tests pass, >85% IMU confirmation rate | Pass |
| RI-5 | On-device inference | Jetson deployment (CV + LLM simultaneous) | All models fit in ~2.5GB of 16GB shared memory | Pass |
| RI-6 | Multiple AI styles | 5 Markov styles tested across 3 difficulty levels | All 5 styles functional with weakness tracking | Pass |
| RI-7 | Local LLM coaching | Live session tip generation + fallback test | Tips every ~18s, 65 fallback tips operational | Pass |
| RI-8 | Person detection & user tracking | Teensy simulator robot status row (5.3.4.6); yaw and height commands verified during live sessions | Yaw tracks user laterally, height data available for user-controlled adjustment via phone or GUI | Pass |
Table: Requirements verification summary, all 8 requirements met
5.3.4.8 Limitations
All requirements were met, but several limitations remain that scope out the natural next direction for the system. The rows highlighted below are the principal limitations, the ones that most shape what the system can and cannot do today, and the ones carried forward into the overall project discussion in Section 6.2.
| Limitation | Impact | Mitigation / Future Work |
|---|---|---|
| Annotation excludes retraction | Punch annotations end at full extension, so the model wobbles during the retraction phase, the IMU + pad-constraint filter in 5.3.2.1 currently absorbs these misclassifications | Add an explicit retraction class so predictions stay correct end-to-end and the IMU filter becomes redundant rather than load-bearing |
| CV model has a limited action vocabulary | The action prediction model only classifies 8 actions (jab, cross, left/right hook, left/right uppercut, block, idle), no slips, no head-vs-torso target distinction, no defensive footwork. Downstream coaching insight is bounded by what these 8 labels can express | Expand the CV model's class set with slips, head/body target variants, and defensive movements, richer labels feed every other part of the system (sparring AI, analytics, LLM coach) and unlock more dynamic feedback |
| Pose estimator not trained for boxing | YOLO Pose is pretrained on COCO, where hands and wrists are bare and clearly visible. It has no concept of what a gloved hand looks like and no training on this camera angle, so wrist keypoints are unreliable exactly when they matter most. The voxel branch currently compensates for this | Fine-tune a custom pose estimator on our own recorded boxing footage with manually annotated keypoints, gloved hands, this camera viewpoint, our pose set. A pose model that was actually trained for the context would lift the load off the voxel branch and improve accuracy further |
| Camera field of view | The RealSense D435i's field of view at the robot-mounted distance only captures the upper body. When the user steps in or extends a punch, hands can leave the frame, and the lower body (footwork, stance width, weight transfer) is never visible. These are critical coaching signals that the CV pipeline currently cannot provide | Move to a wider field-of-view depth camera (or a slightly different mount angle) to capture the full body, unlocking lower-body analytics and full punch travel without giving up the simplicity of a single camera. Multi-camera was deliberately ruled out as impractical for a portable system |
| Manual camera tilt calibration | The D435i's built-in IMU requires the hid_sensor_hub kernel module which isn't in the Tegra kernel, so the camera's tilt has to be set manually at launch via --camera-pitch |
Mount a small external IMU on the camera bracket (e.g. an MPU6050 wired through the existing Teensy) to read the tilt automatically and feed it into the CV node |
Highlighted rows are the principal limitations carried into the project-wide discussion in Section 6.2. The two unhighlighted rows below are smaller, more contained engineering opportunities that don't materially gate the system's current capability.