The CV subsystem delivers three functions: the action prediction model (core), reaction time detection, and human tracking. The latter two consume processed frames from the action prediction pipeline rather than running independent perception stacks.
The Problem: Predicting Punches Before They Land
All three functions above depend on one thing: the CV model must know what the user is doing before the action finishes. Most action recognition research classifies actions after they are fully performed. This project requires action prediction: identifying the punch type while the motion is still in progress, early enough for the robot to counter-punch. A typical jab completes in under 300 ms.
This means the model must handle punches at very different speeds: a fast jab and a slow hook look nothing alike at any given frame. Three mechanisms address this:
- Depth-based motion detection — the voxel branch uses the depth camera to sense the hand moving forward in 3D, which is the earliest physical signal that a punch is starting, regardless of speed.
- Dual-speed temporal channels — the voxel grid compares snapshots at two timescales: a fast channel (2-frame / ~67 ms) that catches the sharp onset and a slow channel (8-frame / ~267 ms) that captures the full arc. Together they cover both quick jabs and slow hooks.
- Causal Transformer + multi-window training — a causal Transformer (Vaswani et al., 2017) lets the attention mechanism dynamically weigh recent vs. earlier frames instead of relying on a fixed window. During training, the model sees windows of both 9 and 12 frames, forcing it to generalise across different punch durations.
Physical Constraints
The camera setup introduces additional challenges not found in typical action recognition datasets:
| Single forward-facing camera | One Intel RealSense D435i (RGB + depth) mounted on the robot, facing the user. No multi-camera triangulation, the system must work from a single viewpoint. |
| Punches toward camera | Unlike side-view datasets, punches travel directly toward the camera. This causes heavy depth foreshortening and makes some punches appear identical from this angle. |
| Glove occlusion | Boxing gloves block the wrist joints at the exact moment of punch extension, precisely when detection matters most for both pose estimation and depth sensing. |
| Partial body visible | The camera sees only the upper body (shoulders to head), not the full body or footwork. |
| Edge deployment | All inference must run on an NVIDIA Jetson Orin NX (16 GB) at 30fps, no cloud, no external GPU. |
The Pose Estimation Paradox
The natural approach to recognising body movements is pose estimation. But off-the-shelf models like YOLO Pose are trained on COCO (a dataset where hands are bare and clearly visible). Boxing gloves remove everything the model expects to see:
The skeleton breaks or jumps at exactly the wrong moment: during punch extension, when the discriminative motion is happening. Pose estimation alone peaked at ~67% accuracy in early iterations.
- Fix 1: Confidence gating
- YOLO reports a confidence for each joint. When confidence drops (as it does during extension), the model automatically scales down the pose signal and leans on the voxel branch instead. Without this, broken skeleton data would actively mislead the network.
- Fix 2: The key insight, pose fails during extension, not preparation
- In the first 100–200 ms before the arm extends, the joints are still visible and they reliably reveal which hand is moving, information voxels alone cannot provide. The model gets the critical left/right call early from pose, then smoothly transitions to voxel-only reasoning as the punch extends. This is the core idea behind the dual-branch fusion architecture.
The model classifies 8 actions: jab, cross, left hook, right hook, left uppercut, right uppercut, block, and idle (the neutral baseline that prevents false positives when the user is not punching).
5.3.1.1 Model Architecture
FusionVoxelPoseTransformerModel (~1.75M parameters)
The model combines two complementary ways of understanding a punch, then uses a Transformer to reason about how the motion unfolds over time.
| Component | Input → Output | Params | Role |
|---|---|---|---|
| Voxel Branch (Conv3D stem) | 12³×2 motion channels → 192-dim | 169,200 (9.7%) | Captures where the body moved in 3D space |
| Pose Branch (encoder) | 42 joint features → 64-dim | 6,016 (0.3%) | Captures which body part moved and how |
| Fusion projection | 256-dim (concatenated) → 192-dim | 49,728 (2.8%) | Merges the two branches into a unified representation |
| Transformer encoder | 12 frames (0.4 s) → 384-dim | 1,483,776 (84.9%) | Reasons over how the motion unfolds across time (4 layers, 8 heads, d=192) |
| Classifier head | 384-dim summary → 8 action classes | 38,504 (2.2%) | Final prediction: jab, cross, hooks, uppercuts, block, or idle |
| Total | – | 1,747,224 | Small enough to run at ~8 ms per frame on the Jetson |
Where the parameters live: ~85% sit in the Transformer encoder (the temporal reasoning core), with the rest split across the Voxel Conv3D stem, the small Pose encoder, and the classifier head. The Pose branch in particular is tiny (~6K params) because it only has to project 42 hand-crafted joint features into the fusion space, most of the heavy lifting is done in the Transformer.
Voxel Branch, 3D Motion Sensing
A voxel is the 3D equivalent of a pixel. Just as a pixel is a tiny square that holds a colour value at one (x, y) location on a flat image, a voxel is a tiny cube that holds a value at one (x, y, z) location in 3D space.
Stack enough of them together and you get a 3D grid, the same way a 2D image is a grid of pixels. In this project, the depth camera turns the space in front of it into a 12×12×12 grid of voxels (each roughly 13 cm on a side), and each voxel records how much the user's body has moved through that little region of space between frames. That's how the model "sees" a punch in 3D.
With that out of the way: this branch divides the space around the user into the 12×12×12 voxel grid described above, creating a 3D snapshot of where the body is at each moment. To detect motion, the model compares these snapshots across time using two channels:
- Fast channel (2-frame difference, ~67ms), detects the sudden onset of a punch
- Slow channel (8-frame difference, ~267ms), captures the full arc of the punch trajectory
Together, these two channels tell the model both when a punch started and what shape the motion followed, a hook sweeps horizontally while an uppercut rises vertically. A 3D convolutional neural network (Conv3D) compresses this spatial information into a compact 192-dimensional vector per frame.
Grid resolution,
12³ (3,456 dims), the lowest resolution that still
captured punch motion with enough spatial detail for accurate classification
while fitting within the Jetson's real-time budget.
Channel reduction, earlier iterations used 10 voxel channels (directional gradients dx/dy/dz + velocity magnitude), but once pose was added these became redundant: pose already provides explicit directional information, so the voxel was cut to just 2 delta channels with no accuracy loss.
Spatial conditioning, the grid is person-centred (anchored to the user's median position with exponential smoothing), depth-weighted so closer voxels carry a stronger signal, and background depth is removed using a 90th-percentile model from the first 30 frames so only the user's body motion remains.
Pose Branch, Body Joint Tracking
A YOLO26 Nano Pose model (yolo26n-pose, Jocher et al., 2025)
detects 7 upper-body joints (nose, shoulders, elbows, wrists) from the RGB camera image
at ~16 ms per frame on the Jetson (see
landing page for why YOLO26 was chosen).
From these joint positions, 42 features are computed: normalised coordinates, confidence
scores, arm extension ratios, shoulder rotation, elbow bend angles, and joint velocities.
The same YOLO model is used during both training and live inference to prevent any
mismatch between the two.
As described in the Pose Estimation Paradox above, the pose signal is scaled by YOLO's joint confidence (confidence gating) so the model automatically falls back to voxel-only when pose becomes unreliable. This is reinforced during training by dropping the pose input entirely for 15% of clips and zeroing individual joints 10% of the time.
The diagram below shows how the two branches feed the fusion projection, then how the Transformer encoder reasons over a 12-frame window before the classifier head emits the final 8-class prediction.
5.3.1.2 Feature Exploration
Development Journey
The final architecture emerged through 9 major iterations, each addressing specific failure modes of the previous approach.
| # | Approach | Accuracy | Key Outcome |
|---|---|---|---|
| 1 | 2D Pose + LSTM | – | Validated pipeline. Training data angles did not match robot-mounted camera. |
| 2 | 3D Pose (RealSense depth lifting) | ~67% | Boxing gloves occlude wrist keypoints at the moment detection matters most. |
| 3 | Raw RGBD + 3D CNNs (Swin3D, 88M params) | 87.2% | Strong recognition accuracy, but model was far too large for edge hardware and required Colab to train. |
| 4 | RGBD + Segmentation + Causal Transformer | – | Still pixel-based, still required Colab for training. |
| 5 | Depth Feature Engineering (Z-Mask grid) | – | 2D depth grid lost 3D spatial info. Hooks vs uppercuts indistinguishable. |
| 6 | Voxel + Colour Tracking | ~96% | Strong offline accuracy, but inflated by a high camera mount that captured the whole body, an angle the deployed robot cannot replicate without raising the camera well above the chassis (enlarging the footprint and needing extra mechanical bracing for stability). Also required red/green gloves, and HSV tracking broke whenever similar colours appeared in the frame. |
| 7 | Voxel-Only Transformer (12³) | 80.6% | Left/right confusion: 19/56 jabs misclassified as cross (34% error). |
| 8 | Voxel + Pose Fusion | 97.3% | Breakthrough. Pose eliminated left/right confusion completely, and unlike iter 6, the model now works with any boxing gloves (or even bare hands), since YOLO Pose locates the wrist joint visually rather than tracking a specific colour. Full circle, pose returned. |
| 9 | Production (v11, Jetson) | 96.8% | Block annotation fix, TensorRT optimisation, parallel processing. Deployed. |
Table: Development iterations, each built on specific failure modes of the previous
How Three Phases Converged
The 9 iterations followed three phases of exploration. Each phase failed on its own, but taught something the final architecture needed:
✗ Wrong camera angle
✗ Glove occlusion blocks wrists
✗ 88M params, too large for Jetson
✗ Still pixel-based, still required Colab
✗ Lost 3D spatial information
✗ Impractical coloured gloves
★ Insight: voxels need hand info
✗ Left/right confusion (34% jab error)
★ Insight: need pose back for L/R
Pose solves left/right confusion. Voxels capture 3D motion. Full circle.
Works with any boxing gloves, no coloured gloves required, since YOLO finds the wrist visually.
TensorRT, block annotation fix, parallel threads
Feature Comparison
Over 9 iterations, different feature representations were explored to find the best way to capture boxing motion from a depth camera. The videos below show the same footage processed through each method, from raw pixels (1) through skeleton-based approaches (2–3), cropped and segmented inputs (4–5), depth-only grids (6–7), to colour-based glove tracking (8). Each approach revealed specific limitations that guided the next iteration, ultimately leading to the voxel + pose fusion architecture. Full technical details for each iteration are documented in Appendix 7.
Inference Results Across Iterations
The videos below show two key earlier iterations side by side. They are doing different tasks (see the recognition vs. prediction distinction at the top of this page), so their accuracies are not directly comparable.
Final Model Inference (Deployed v11)
Both participants were unseen during training. Person 1's looser form lowers confidence on ambiguous punches, while Person 2's cleaner technique lets the model commit to the prediction before the punch lands. Both clips also show the model wobbling during punch retraction (not in the training annotations), a limitation that 5.3.2's CV + IMU fusion cleans up via pad-constraint filtering.
5.3.1.3 Data Collection and Training
Recording Infrastructure
A custom recording tool captures from an Intel RealSense D435i mounted at ~1.5m height, 1–2m from the user. Footage is recorded at 60fps (960×540 RGB + 848×480 depth) and then downsampled to 30fps for training. Capturing at 60fps keeps each frame sharp by reducing motion blur during fast punches, while sampling down to 30fps gives the model enough temporal information to learn the motion without doubling the data and compute cost, the best of both.
| Depth technology | Active stereo IR with built-in projector (works in low light, struggles in direct sunlight) |
| Depth resolution / FPS | Up to 1280×720 @ 90fps (project records 848×480 @ 60fps, downsampled to 30fps for training) |
| RGB resolution / FPS | Up to 1920×1080 @ 30fps (project records 960×540 @ 60fps to match depth alignment, downsampled to 30fps) |
| Depth range / accuracy | ~0.3–3m usable; ~±2% error at 2m, comfortably within the 1–2m boxing distance |
| Field of view (depth) | 87° × 58°, wide enough to capture the full upper body at 1m |
| Built-in IMU | 6-axis (BMI055), not used in this project (Tegra kernel module unavailable on Jetson, see limitations); pad IMUs handle motion sensing instead |
| Interface | USB 3.0, single cable to Jetson |
Problem: early recordings silently dropped depth frames, ~15 fps actual versus 30 fps expected, because the laptop couldn't write float32 depth arrays to the external SSD fast enough. Every temporal voxel feature was being computed against corrupted timing.
Fix: a dedicated capture thread that runs independently of disk I/O, paired with an async writer thread feeding off a 300-frame queue buffer. Non-blocking enqueue plus frame-drop diagnostics caught any future regressions.
Cost: all earlier footage had to be re-collected and re-annotated from scratch.
The custom recording tool below wraps all of this, the dual-thread capture/writer pipeline, the diagnostics, and a few common camera presets, behind a single one-click interface so a recording session can start without any command-line setup.
Annotation
A custom timeline-based annotation tool (annotate_punches_gui.py) with drag-to-resize segments and
auto-save every 30 seconds. Punch segments span from the first preparatory cue (shoulder drop, hip rotation)
to full extension, retraction is excluded because it looks similar across punch types.
Almost all of the dataset was annotated manually by me, thousands of segments across the project. Every time something changed (camera angle, recording setup, labelling scheme), days of footage had to be re-recorded and re-annotated from scratch. The annotation GUI (keyboard-driven labelling, drag-to-resize, segment shortcuts, auto-save) was what made this re-annotation loop survivable at the iteration pace the project demanded.
Problem: block annotations originally covered only the onset (arms rising into guard). This caused two live failures, the rising motion looked like an uppercut, and once the user held the guard the model had never seen that pose, so predictions drifted to idle.
Fix: annotate the full guard sequence, rise, hold, and return , so the model learns every phase of a block.
Result: validation accuracy stayed effectively the same (v10: 97.3%, v11: 96.8%), but live block detection improved dramatically. The gain came entirely from the annotation change, not the model. Block remains the lowest class at 66.7% validation accuracy because it is the smallest class, but its real-world reliability is significantly higher.
The screenshot below shows the annotation GUI in action: each coloured bar on the timeline is one labelled punch segment, with start and end frames draggable to refine the boundaries.
Dataset
The dataset uses a leave-one-person-out validation strategy, the model is trained on data from two people and validated on a completely unseen third person. This directly tests whether the model generalises to new users (the real deployment requirement) rather than memorising individual movement patterns.
| Split | Segments | Clips | Notes |
|---|---|---|---|
| Training | 808 | 7 | 2 people with consistent technique |
| Validation | 221 | 1 | Unseen person (leave-one-person-out) |
| Excluded | 221 | 3 | 1 person removed, inconsistent form (see below) |
8 action classes across all splits. Class imbalance (block is smallest) mitigated with class-balanced sampling (each epoch sees roughly equal numbers of every class) and focal loss (γ=1.5, a loss function that down-weights easy examples so the model focuses on rare/hard ones).
Small dataset, big constraint. Although more people contributed footage, only three of us had enough boxing experience to produce consistent technique, so the usable dataset came down to three subjects. With so few, each person's individual quirks became an outsized signal, the leave-one-person-out split left just a single unseen subject for validation, and rare classes like block ended up with only a handful of segments.
That the model still reaches 97.3% validation accuracy under these conditions is what makes the approach interesting, it also points to clear headroom. A larger, more diverse dataset from experienced boxers with cleaner technique would unlock better generalisation, more reliable rare-class detection, and room for deeper architectures the current data cannot support.
Bottleneck: 6 different architectures were tested and all plateaued at around 92.8% validation accuracy. Hyperparameter tuning, augmentation strength, deeper transformers, none of it moved the needle.
Insight: the bottleneck wasn't the model. One contributor's inconsistent technique was poisoning training and validation alike. Removing their clips and re-validating on a different unseen person was the only thing that hadn't been tried.
Result: accuracy jumped from 92.8% to 97.3% (+4.5%), larger than any architecture change across all 6 models, and the single biggest accuracy gain in the entire project.
The chart below puts the +4.5% data curation jump next to the deltas from every architecture experiment for direct comparison, the data fix is visibly larger than the entire architecture-search effort combined.
Training
| Parameter | Value | What it does |
|---|---|---|
| Epochs | 300 (early stop patience 60) | How many full passes over the training data the model is allowed; training stops automatically if validation accuracy hasn't improved in 60 epochs to avoid wasted compute and overfitting. |
| Batch size | 16 | Number of training samples processed together before updating the model weights, small enough to fit in GPU memory, large enough to give a stable gradient signal. |
| Optimiser | AdamW (LR 0.0005, weight decay 0.01) | The algorithm that adjusts the model weights each step. AdamW is a strong default for Transformers; weight decay gently shrinks unused weights to prevent overfitting. |
| LR schedule | Warmup 20 epochs → cosine decay to 1e-6 | How the learning rate changes over time. Starts small (warmup) so early steps don't destabilise the model, then smoothly decays so later steps make finer and finer adjustments. |
| Loss | Focal loss (γ=1.5) + label smoothing (0.03) | The function the model is trying to minimise. Focal loss down-weights easy examples so training focuses on hard/rare classes; label smoothing makes the model less overconfident, which improves generalisation. |
| Best epoch | 73 / 133 trained | The epoch at which validation accuracy peaked. The model trained for 133 epochs total but the checkpoint from epoch 73 was the one saved and deployed. |
Key augmentation, horizontal flip (50%): exploits boxing's left/right symmetry by flipping the voxel X-axis, swapping left/right pose features, and swapping labels (jab↔cross, left_hook↔right_hook, left_uppercut↔right_uppercut), effectively doubling the dataset. Other augmentations: speed (0.8-1.2×), voxel shift/cutout/stretch, multi-window (9, 12 frames).
The end-to-end pipeline below shows how raw recordings move through annotation, feature extraction (voxel + pose), training on the workstation, and finally TensorRT deployment to the Jetson.
Architecture Search
Before finalising the model configuration, 8 variants were tested systematically to find the optimal balance of architecture, augmentation, and attention mode:
| Variant | Architecture | Key Change | Accuracy |
|---|---|---|---|
| v1 | Causal, d=192 | Heavy augmentation (all maxed) | 86.0% |
| v2 | Causal, d=192 | Moderate augmentation | 91.0% |
| v3 | Causal, d=192 | No horizontal flip | 90.1% |
| v4 | Causal, d=128, 2 layers | Smaller model + mixup | 88.7% |
| v5 | Bidirectional, d=192 | Non-causal attention | 92.8% |
| v6 | Causal, d=192 | Temporal stats features | 88.2% |
| v7 | Causal, d=192 | Explicit pose velocity | 87.3% |
| v8 | Conv1D temporal | Replace attention with convolution | 90.9% |
Table: Architecture search, all variants tested before data curation
Key findings: heavy augmentation causes underfitting (v1 vs v2), smaller models underfit (v4), and Transformer attention outperforms Conv1D temporal encoding (v8). These results preceded the +4.5% data curation improvement, which pushed the final model to 97.3%.
5.3.1.4 Ablation Study
| Mode | Validation Accuracy | What It Captures |
|---|---|---|
| Pose only | 44.1% | Joint positions but no spatial/3D motion context |
| Voxel only | 80.6% | 3D motion trajectories but cannot distinguish left from right |
| Both fused | 97.3% | Complete picture, spatial motion + joint-level discrimination |
The branches provide genuinely complementary information. Without pose, 34% of jabs were misclassified as cross because both produce identical "straight forward" voxel patterns, only wrist positions from the pose stream reveal which hand is punching. The voxel branch in turn captures 3D arc trajectories that distinguish hooks from uppercuts, which 2D pose alone cannot represent. Fusion eliminated left/right confusion completely, yielding a +16.7% gain over voxel-only.
5.3.1.5 Secondary CV Functions
The same YOLO pose output and depth bounding box that feed the action model are reused by two lighter-weight functions, neither runs an independent perception stack, so they add negligible compute on top of what the action model already costs.
-
Human Tracking, bounding box centre, width, depth, and
lateral/depth displacement are published each frame as a
UserTrackingmessage. Downstream nodes use this to drive the yaw motor so the robot continuously faces the user as they move laterally, provide height data so the user can adjust the robot's height via the phone or GUI, and classify slips and dodges for defence detection. The yaw and height motor command pipelines were validated end-to-end using the Teensy simulator (5.3.4.6), where the robot status row displays the live motor commands being sent, confirming correct tracking behaviour without requiring physical hardware. - Reaction Time Detection, depth-based proximity replaces the velocity thresholds that proved unreliable at the CDE Fair. When a stimulus fires, the system timestamps the moment the user's hand depth crosses a calibrated threshold toward the robot, giving millisecond-accurate reaction times.
Both functions read directly off the bounding box and 7 upper-body keypoints shown below , the same per-frame YOLO output the action model already consumes.