The CV subsystem delivers three functions: the action prediction model (core), reaction time detection, and human tracking. The latter two consume processed frames from the action prediction pipeline rather than running independent perception stacks.
The Problem: Predicting Punches Before They Land
All three functions above depend on one thing: the CV model must know what the user is doing before the action finishes. Most action recognition research classifies actions after they are fully performed. This project requires action prediction: identifying the punch type while the motion is still in progress, early enough for the robot to counter-punch. A typical jab completes in under 300 ms.
This means the model must handle punches at very different speeds: a fast jab and a slow hook look nothing alike at any given frame. Three mechanisms address this:
- Depth-based motion detection: the voxel branch uses the depth camera to sense the hand moving forward in 3D, which is the earliest physical signal that a punch is starting, regardless of speed.
- Dual-speed temporal channels: the voxel branch reads motion at two timescales, one sensitive to a sudden punch onset and one that captures the full arc, so fast jabs and slow hooks are both resolved (details in 5.3.1.1 Voxel Branch).
- Causal Transformer + multi-window training: a causal Transformer (Vaswani et al., 2017) lets the attention mechanism dynamically weigh recent vs. earlier frames instead of relying on a fixed window. During training, the model sees windows of both 9 and 12 frames, forcing it to generalise across different punch durations.
Physical Constraints
The camera setup introduces additional challenges not found in typical action recognition datasets:
| Single forward-facing camera | One Intel RealSense D435i (RGB + depth) mounted on the robot, facing the user. Budget constraints rule out multi-camera setups typical in sports analytics, so the system must work from a single viewpoint. |
| Punches toward camera | Unlike side-view datasets, punches travel directly toward the camera. This causes heavy depth foreshortening and makes some punches appear identical from this angle. |
| Glove occlusion | Boxing gloves block the wrist joints at the exact moment of punch extension, precisely when detection matters most for both pose estimation and depth sensing. |
| Partial body visible | The camera sees only the upper body (shoulders to head), not the full body or footwork. |
| Edge deployment | All inference must run on an NVIDIA Jetson Orin NX (16 GB) at 30fps, no cloud, no external GPU. |
The model classifies 8 actions: jab, cross, left hook, right hook, left uppercut, right uppercut, block, and idle (the neutral baseline that prevents false positives when the user is not punching).
5.3.1.1 Model Architecture
FusionVoxelPoseTransformerModel (~1.75M parameters)
The model combines two complementary ways of understanding a punch, then uses a Transformer to reason about how the motion unfolds over time.
| Component | Input → Output | Params | Role |
|---|---|---|---|
| Voxel Branch (Conv3D stem) | 12³×2 motion channels → 192-dim | 169,200 (9.7%) | Captures where the body moved in 3D space |
| Pose Branch (encoder) | 42 joint features → 64-dim | 6,016 (0.3%) | Captures which body part moved and how |
| Fusion projection | 256-dim (concatenated) → 192-dim | 49,728 (2.8%) | Merges the two branches into a unified representation |
| Transformer encoder | 12 frames (0.4 s) → 384-dim | 1,483,776 (84.9%) | Reasons over how the motion unfolds across time (4 layers, 8 heads, d=192) |
| Classifier head | 384-dim summary → 8 action classes | 38,504 (2.2%) | Final prediction: jab, cross, hooks, uppercuts, block, or idle |
| Total | – | 1,747,224 | Small enough to run at ~8 ms per frame on the Jetson |
Where the parameters live: ~85% sit in the Transformer encoder (the temporal reasoning core), with the rest split across the Voxel Conv3D stem, the small Pose encoder, and the classifier head. The Pose branch in particular is tiny (~6K params) because it only has to project 42 hand-crafted joint features into the fusion space, most of the heavy lifting is done in the Transformer.
Voxel Branch, 3D Motion Sensing
A voxel is the 3D equivalent of a pixel. Just as a pixel is a tiny square that holds a colour value at one (x, y) location on a flat image, a voxel is a tiny cube that holds a value at one (x, y, z) location in 3D space.
Stack enough of them together and you get a 3D grid, the same way a 2D image is a grid of pixels. In this project, the depth camera turns the space in front of it into a 12×12×12 grid of voxels (each roughly 13 cm on a side), and each voxel records how much the user's body has moved through that little region of space between frames. That's how the model "sees" a punch in 3D.
With that out of the way: this branch divides the space around the user into the 12×12×12 voxel grid described above, creating a 3D snapshot of where the body is at each moment. To detect motion, the model compares these snapshots across time using two channels:
- Fast channel (2-frame difference, ~67ms), detects the sudden onset of a punch
- Slow channel (8-frame difference, ~267ms), captures the full arc of the punch trajectory
Together, these two channels tell the model both when a punch started and what shape the motion followed, a hook sweeps horizontally while an uppercut rises vertically. A 3D convolutional neural network (Conv3D) compresses this spatial information into a compact 192-dimensional vector per frame.
Grid resolution,
12³ (3,456 dims), the lowest resolution that still
captured punch motion with enough spatial detail for accurate classification
while fitting within the Jetson's real-time budget.
Channel reduction, earlier iterations used 10 voxel channels (directional gradients dx/dy/dz + velocity magnitude), but once pose was added these became redundant: pose already provides explicit directional information, so the voxel was cut to just 2 delta channels with no accuracy loss.
Spatial conditioning, the grid is person-centred (anchored to the user's median position with exponential smoothing), depth-weighted so closer voxels carry a stronger signal, and background depth is removed using a 90th-percentile model from the first 30 frames so only the user's body motion remains.
Pose Branch, Body Joint Tracking
A YOLO26 Nano Pose model (yolo26n-pose, Jocher et al., 2025)
detects 7 upper-body joints (nose, shoulders, elbows, wrists) from the RGB camera image
at ~16 ms per frame on the Jetson (see
landing page for why YOLO26 was chosen).
From these joint positions, 42 features are computed: normalised coordinates, confidence
scores, arm extension ratios, shoulder rotation, elbow bend angles, and joint velocities.
The same YOLO model is used during both training and live inference to prevent any
mismatch between the two.
The Pose Estimation Paradox
With both branches now defined, the natural question is why the model needs both at all, rather than relying on pose alone (the standard approach for body-movement recognition). Off-the-shelf YOLO Pose is trained on COCO, a dataset where hands are bare and clearly visible. Boxing gloves remove everything the model expects to see:
The skeleton breaks or jumps at exactly the wrong moment: during punch extension, when the discriminative motion is happening. Pose estimation alone peaked at ~67% accuracy in early iterations.
- Fix 1: Confidence gating
- YOLO reports a confidence for each joint. When confidence drops (as it does during extension), the model automatically scales down the pose signal and leans on the voxel branch instead, so broken skeleton data never actively misleads the network.
- Fix 2: The key insight, pose fails during extension, not preparation
- In the first 100–200 ms before the arm extends, the joints are still visible and they reliably reveal which hand is moving, information voxels alone cannot provide. The model gets the critical left/right call early from pose, then smoothly transitions to voxel-only reasoning as the punch extends, which is precisely what the dual-branch fusion below is designed to exploit.
In practice the confidence gating above is reinforced during training by dropping the pose input entirely for 15% of clips and zeroing individual joints 10% of the time, so the network never becomes dependent on a pose signal it cannot trust at inference.
The diagram below shows how the two branches feed the fusion projection, then how the Transformer encoder reasons over a 12-frame window before the classifier head emits the final 8-class prediction.
5.3.1.2 Feature Exploration
Development Journey
The final architecture emerged through 9 major iterations, each addressing specific failure modes of the previous approach.
| # | Approach | Accuracy | Key Outcome |
|---|---|---|---|
| 1 | 2D Pose + LSTM | – | Validated pipeline. Training data angles did not match robot-mounted camera. |
| 2 | 3D Pose (RealSense depth lifting) | ~67% | Boxing gloves occlude wrist keypoints at the moment detection matters most. |
| 3 | Raw RGBD + 3D CNNs (Swin3D, 88M params) | 87.2% | Strong recognition accuracy, but model was far too large for edge hardware and required Colab to train. |
| 4 | RGBD + Segmentation + Causal Transformer | – | Still pixel-based, still required Colab for training. |
| 5 | Depth Feature Engineering (Z-Mask grid) | – | 2D depth grid lost 3D spatial info. Hooks vs uppercuts indistinguishable. |
| 6 | Voxel + Colour Tracking | ~96% | Strong offline accuracy, but inflated by a high camera mount that captured the whole body, an angle the deployed robot cannot replicate without raising the camera well above the chassis (enlarging the footprint and needing extra mechanical bracing for stability). Also required red/green gloves, and HSV tracking broke whenever similar colours appeared in the frame. |
| 7 | Voxel-Only Transformer (12³) | 80.6% | Left/right confusion: 19/56 jabs misclassified as cross (34% error). |
| 8 | Voxel + Pose Fusion | 97.3% | Breakthrough. Pose eliminated left/right confusion completely, and unlike iter 6, the model now works with any boxing gloves (or even bare hands), since YOLO Pose locates the wrist joint visually rather than tracking a specific colour. Full circle, pose returned. |
| 9 | Production (v11, Jetson) | 96.8% | Block annotation fix, TensorRT optimisation, parallel processing. Deployed. |
Table: Development iterations, each built on specific failure modes of the previous
How Three Phases Converged
The 9 iterations followed three phases of exploration. Each phase failed on its own, but taught something the final architecture needed:
✗ Wrong camera angle
✗ Glove occlusion blocks wrists
✗ 88M params, too large for Jetson
✗ Still pixel-based, still required Colab
✗ Lost 3D spatial information
✗ Impractical coloured gloves
★ Insight: voxels need hand info
✗ Left/right confusion (34% jab error)
★ Insight: need pose back for L/R
Pose solves left/right confusion. Voxels capture 3D motion. Full circle.
Works with any boxing gloves, no coloured gloves required, since YOLO finds the wrist visually.
TensorRT, block annotation fix, parallel threads
Feature Comparison
Over 9 iterations, different feature representations were explored to find the best way to capture boxing motion from a depth camera. The videos below show the same footage processed through each method, from raw pixels (1) through skeleton-based approaches (2–3), cropped and segmented inputs (4–5), depth-only grids (6–7), to colour-based glove tracking (8). Each approach revealed specific limitations that guided the next iteration, ultimately leading to the voxel + pose fusion architecture. Full technical details for each iteration are documented in Appendix 7.
Inference Results Across Iterations
The videos below show two key earlier iterations side by side. The RGBD model (left) is too heavy to run live on the Jetson, so inference is shown offline on recorded clips; the lighter voxel+colour model (right) runs live on-device.
Final Model Inference (Deployed v11)
Both participants were unseen during training. Person 1's looser form lowers confidence on ambiguous punches, while Person 2's cleaner technique lets the model commit to the prediction before the punch lands. Both clips also show the model wobbling during punch retraction (not in the training annotations), a limitation that 5.3.2's CV + IMU fusion cleans up via pad-constraint filtering.
5.3.1.3 Data Collection and Training
Recording Infrastructure
A custom recording tool captures from an Intel RealSense D435i mounted at ~1.5m height, 1–2m from the user. Footage is recorded at 60fps (960×540 RGB + 848×480 depth) and then downsampled to 30fps for training. Capturing at 60fps keeps each frame sharp by reducing motion blur during fast punches, while sampling down to 30fps gives the model enough temporal information to learn the motion without doubling the data and compute cost, the best of both.
| Depth technology | Active stereo IR with built-in projector (works in low light, struggles in direct sunlight) |
| Depth resolution / FPS | Up to 1280×720 @ 90fps (project records 848×480 @ 60fps, downsampled to 30fps for training) |
| RGB resolution / FPS | Up to 1920×1080 @ 30fps (project records 960×540 @ 60fps to match depth alignment, downsampled to 30fps) |
| Depth range / accuracy | ~0.3–3m usable; ~±2% error at 2m, comfortably within the 1–2m boxing distance |
| Field of view (depth) | 87° × 58°, wide enough to capture the full upper body at 1m |
| Built-in IMU | 6-axis (BMI055), not used in this project (Tegra kernel module unavailable on Jetson, see limitations); pad IMUs handle motion sensing instead |
| Interface | USB 3.0, single cable to Jetson |
Problem: early recordings silently dropped depth frames, ~15 fps actual versus 30 fps expected, because the laptop couldn't write float32 depth arrays to the external SSD fast enough. Every temporal voxel feature was being computed against corrupted timing.
Fix: a dedicated capture thread that runs independently of disk I/O, paired with an async writer thread feeding off a 300-frame queue buffer. Non-blocking enqueue plus frame-drop diagnostics caught any future regressions.
Cost: all earlier footage had to be re-collected and re-annotated from scratch.
The custom recording tool below wraps all of this, the dual-thread capture/writer pipeline, the diagnostics, and a few common camera presets, behind a single one-click interface so a recording session can start without any command-line setup.
Annotation
A custom timeline-based annotation tool (annotate_punches_gui.py) with drag-to-resize segments and
auto-save every 30 seconds. Punch segments span from the first preparatory cue (shoulder drop, hip rotation)
to full extension, retraction is excluded because it looks similar across punch types.
Almost all of the dataset was annotated manually by me, thousands of segments across the project. Every time something changed (camera angle, recording setup, labelling scheme), days of footage had to be re-recorded and re-annotated from scratch. The annotation GUI (keyboard-driven labelling, drag-to-resize, segment shortcuts, auto-save) was what made this re-annotation loop survivable at the iteration pace the project demanded.
Problem: block annotations originally covered only the onset (arms rising into guard). This caused two live failures, the rising motion looked like an uppercut, and once the user held the guard the model had never seen that pose, so predictions drifted to idle.
Fix: annotate the full guard sequence, rise, hold, and return , so the model learns every phase of a block.
Result: validation accuracy stayed effectively the same (v10: 97.3%, v11: 96.8%), but live block detection improved dramatically. The gain came entirely from the annotation change, not the model. Block remains the lowest class at 66.7% validation accuracy because it is the smallest class, but its real-world reliability is significantly higher.
The screenshot below shows the annotation GUI in action: each coloured bar on the timeline is one labelled punch segment, with start and end frames draggable to refine the boundaries.
Dataset
The dataset uses a leave-one-person-out validation strategy, the model is trained on data from two people and validated on a completely unseen third person. This directly tests whether the model generalises to new users (the real deployment requirement) rather than memorising individual movement patterns.
| Split | Segments | Clips | Notes |
|---|---|---|---|
| Training | 808 | 7 | 2 people with consistent technique |
| Validation | 221 | 1 | Unseen person (leave-one-person-out) |
| Excluded | 221 | 3 | 1 person removed, inconsistent form (see below) |
8 action classes across all splits. Class imbalance (block is smallest) mitigated with class-balanced sampling (each epoch sees roughly equal numbers of every class) and focal loss (γ=1.5, a loss function that down-weights easy examples so the model focuses on rare/hard ones).
Small dataset, big constraint. Although more people contributed footage, only three of us had enough boxing experience to produce consistent technique, so the usable dataset came down to three subjects. With so few, each person's individual quirks became an outsized signal, the leave-one-person-out split left just a single unseen subject for validation, and rare classes like block ended up with only a handful of segments.
That the model still reaches 97.3% validation accuracy under these conditions is what makes the approach interesting, it also points to clear headroom. A larger, more diverse dataset from experienced boxers with cleaner technique would unlock better generalisation, more reliable rare-class detection, and room for deeper architectures the current data cannot support.
Bottleneck: 6 different architectures were tested and all plateaued at around 92.8% validation accuracy. Hyperparameter tuning, augmentation strength, deeper transformers, none of it moved the needle.
Insight: the bottleneck wasn't the model. One contributor's inconsistent technique was poisoning training and validation alike. Removing their clips and re-validating on a different unseen person was the only thing that hadn't been tried.
Result: accuracy jumped from 92.8% to 97.3% (+4.5%), larger than any architecture change across all 6 models, and the single biggest accuracy gain in the entire project.
The chart below puts the +4.5% data curation jump next to the deltas from every architecture experiment for direct comparison, the data fix is visibly larger than the entire architecture-search effort combined.
Training
The model was trained for 133 epochs with AdamW (LR 0.0005, weight decay 0.01), a warmup + cosine learning-rate schedule, and focal loss + label smoothing to handle class imbalance. The deployed checkpoint came from epoch 73, the best validation accuracy point. Horizontal flip (50%) was the key augmentation, exploiting boxing's left/right symmetry to effectively double the dataset; training used a fixed seed of 42 for reproducibility. Full hyperparameter table in Appendix 7.I.
Architecture Search
Before finalising the configuration, 8 variants were tested to find the best balance of attention mode, model size, and augmentation strength. Key takeaways: heavy augmentation underfits (v1 86.0% vs v2 91.0%), smaller models underfit (v4 88.7%), and Transformer attention outperforms Conv1D temporal encoding (v8 90.9%). These architecture sweeps preceded the +4.5% data-curation improvement that pushed the final model to 97.3%. Full 8-variant table in Appendix 7.I.
5.3.1.4 Ablation Study
| Mode | Validation Accuracy | What It Captures |
|---|---|---|
| Pose only | 44.1% | Joint positions but no spatial/3D motion context |
| Voxel only | 80.6% | 3D motion trajectories but cannot distinguish left from right |
| Both fused | 97.3% | Complete picture, spatial motion + joint-level discrimination |
The branches provide genuinely complementary information. Without pose, 34% of jabs were misclassified as cross because both produce identical "straight forward" voxel patterns, only wrist positions from the pose stream reveal which hand is punching. The voxel branch in turn captures 3D arc trajectories that distinguish hooks from uppercuts, which 2D pose alone cannot represent. Fusion eliminated left/right confusion completely, yielding a +16.7% gain over voxel-only.
5.3.1.5 Secondary CV Functions
Two lighter-weight functions reuse the same per-frame YOLO pose output and depth bounding box that feed the action model, so they add negligible compute on top of what the action model already costs.
What it publishes. Bounding-box centre, width, depth, and
lateral/depth displacement, emitted every frame as a UserTracking
message.
What uses it. The yaw motor rotates to face the user as they move laterally, and the slips / dodges classifier uses the same displacement signal for defence detection.
Height adjustment. Manual in the current system; using CV-derived user height to drive the height mechanism automatically is future work in Section 6.3.
Validation. Off-hardware via the Teensy simulator (5.3.4.6), on-hardware via the yaw-tracking video in 5.3.4.5 (safe rotation speed demonstrated).
How it works. When a stimulus fires, the system timestamps the moment the user's hand depth crosses a calibrated threshold toward the robot, giving millisecond-accurate reaction times.
Why depth-based. Depth-based proximity replaces the velocity thresholds that proved unreliable at the CDE Fair.
Validation. The reaction-time drill video in 5.3.4.5 shows end-to-end behaviour and sensitivity tuning that correctly ignores small body movements.