Back to Robot Intelligence

5.3.1 CV & Action Prediction

RI-1, RI-3, RI-5, RI-8 Repository
Problem Architecture 9 iterations Data & training Ablation Secondary

The CV subsystem delivers three functions: the action prediction model (core), reaction time detection, and human tracking. The latter two consume processed frames from the action prediction pipeline rather than running independent perception stacks.

Performance Analytics
Unified punch power, accuracy, and timing metrics with sensor-CV synchronisation.
Reaction Time
Adaptive Fight Intelligence
Counterpunching that adapts dynamically to user attack patterns and session performance.
🚶
Human Tracking
Intelligent Sparring System
Active arm actuation and haptic feedback simulating real incoming strikes, triggered by CV.
🥊
Counter-punching
Figure: Product needs and the CV functions that feed them. The CV pipeline provides the data; downstream subsections (5.3.2, 5.3.3) fulfil the needs

The Problem: Predicting Punches Before They Land

All three functions above depend on one thing: the CV model must know what the user is doing before the action finishes. Most action recognition research classifies actions after they are fully performed. This project requires action prediction: identifying the punch type while the motion is still in progress, early enough for the robot to counter-punch. A typical jab completes in under 300 ms.

Action recognition (standard)
Watch the entire clip, then label what happened. Sees the full motion before deciding.
Action prediction (this project)
Classify frame-by-frame using only the frames seen so far (causal). Must commit before the punch finishes.

This means the model must handle punches at very different speeds: a fast jab and a slow hook look nothing alike at any given frame. Three mechanisms address this:

Physical Constraints

The camera setup introduces additional challenges not found in typical action recognition datasets:

Single forward-facing camera One Intel RealSense D435i (RGB + depth) mounted on the robot, facing the user. No multi-camera triangulation, the system must work from a single viewpoint.
Punches toward camera Unlike side-view datasets, punches travel directly toward the camera. This causes heavy depth foreshortening and makes some punches appear identical from this angle.
Glove occlusion Boxing gloves block the wrist joints at the exact moment of punch extension, precisely when detection matters most for both pose estimation and depth sensing.
Partial body visible The camera sees only the upper body (shoulders to head), not the full body or footwork.
Edge deployment All inference must run on an NVIDIA Jetson Orin NX (16 GB) at 30fps, no cloud, no external GPU.

The Pose Estimation Paradox

The natural approach to recognising body movements is pose estimation. But off-the-shelf models like YOLO Pose are trained on COCO (a dataset where hands are bare and clearly visible). Boxing gloves remove everything the model expects to see:

What YOLO expects
Bare skin, fingers, fine joint contours. Wrists are always visible.
What boxing gives it
Thick padded gloves burying the wrists. Nothing recognisable to lock onto at punch extension.

The skeleton breaks or jumps at exactly the wrong moment: during punch extension, when the discriminative motion is happening. Pose estimation alone peaked at ~67% accuracy in early iterations.

Fix 1: Confidence gating
YOLO reports a confidence for each joint. When confidence drops (as it does during extension), the model automatically scales down the pose signal and leans on the voxel branch instead. Without this, broken skeleton data would actively mislead the network.
Fix 2: The key insight, pose fails during extension, not preparation
In the first 100–200 ms before the arm extends, the joints are still visible and they reliably reveal which hand is moving, information voxels alone cannot provide. The model gets the critical left/right call early from pose, then smoothly transitions to voxel-only reasoning as the punch extends. This is the core idea behind the dual-branch fusion architecture.

The model classifies 8 actions: jab, cross, left hook, right hook, left uppercut, right uppercut, block, and idle (the neutral baseline that prevents false positives when the user is not punching).

5.3.1.1 Model Architecture

RI-1, RI-3

FusionVoxelPoseTransformerModel (~1.75M parameters)

The model combines two complementary ways of understanding a punch, then uses a Transformer to reason about how the motion unfolds over time.

Component Input → Output Params Role
Voxel Branch (Conv3D stem) 12³×2 motion channels → 192-dim 169,200 (9.7%) Captures where the body moved in 3D space
Pose Branch (encoder) 42 joint features → 64-dim 6,016 (0.3%) Captures which body part moved and how
Fusion projection 256-dim (concatenated) → 192-dim 49,728 (2.8%) Merges the two branches into a unified representation
Transformer encoder 12 frames (0.4 s) → 384-dim 1,483,776 (84.9%) Reasons over how the motion unfolds across time (4 layers, 8 heads, d=192)
Classifier head 384-dim summary → 8 action classes 38,504 (2.2%) Final prediction: jab, cross, hooks, uppercuts, block, or idle
Total 1,747,224 Small enough to run at ~8 ms per frame on the Jetson

Where the parameters live: ~85% sit in the Transformer encoder (the temporal reasoning core), with the rest split across the Voxel Conv3D stem, the small Pose encoder, and the classifier head. The Pose branch in particular is tiny (~6K params) because it only has to project 42 hand-crafted joint features into the fusion space, most of the heavy lifting is done in the Transformer.

Voxel Branch, 3D Motion Sensing

What is a voxel?

A voxel is the 3D equivalent of a pixel. Just as a pixel is a tiny square that holds a colour value at one (x, y) location on a flat image, a voxel is a tiny cube that holds a value at one (x, y, z) location in 3D space.

Stack enough of them together and you get a 3D grid, the same way a 2D image is a grid of pixels. In this project, the depth camera turns the space in front of it into a 12×12×12 grid of voxels (each roughly 13 cm on a side), and each voxel records how much the user's body has moved through that little region of space between frames. That's how the model "sees" a punch in 3D.

With that out of the way: this branch divides the space around the user into the 12×12×12 voxel grid described above, creating a 3D snapshot of where the body is at each moment. To detect motion, the model compares these snapshots across time using two channels:

Together, these two channels tell the model both when a punch started and what shape the motion followed, a hook sweeps horizontally while an uppercut rises vertically. A 3D convolutional neural network (Conv3D) compresses this spatial information into a compact 192-dimensional vector per frame.

Design rationale

Grid resolution, 12³ (3,456 dims), the lowest resolution that still captured punch motion with enough spatial detail for accurate classification while fitting within the Jetson's real-time budget.

Channel reduction, earlier iterations used 10 voxel channels (directional gradients dx/dy/dz + velocity magnitude), but once pose was added these became redundant: pose already provides explicit directional information, so the voxel was cut to just 2 delta channels with no accuracy loss.

Spatial conditioning, the grid is person-centred (anchored to the user's median position with exponential smoothing), depth-weighted so closer voxels carry a stronger signal, and background depth is removed using a 90th-percentile model from the first 30 frames so only the user's body motion remains.

Pose Branch, Body Joint Tracking

A YOLO26 Nano Pose model (yolo26n-pose, Jocher et al., 2025) detects 7 upper-body joints (nose, shoulders, elbows, wrists) from the RGB camera image at ~16 ms per frame on the Jetson (see landing page for why YOLO26 was chosen). From these joint positions, 42 features are computed: normalised coordinates, confidence scores, arm extension ratios, shoulder rotation, elbow bend angles, and joint velocities. The same YOLO model is used during both training and live inference to prevent any mismatch between the two.

As described in the Pose Estimation Paradox above, the pose signal is scaled by YOLO's joint confidence (confidence gating) so the model automatically falls back to voxel-only when pose becomes unreliable. This is reinforced during training by dropping the pose input entirely for 15% of clips and zeroing individual joints 10% of the time.

The diagram below shows how the two branches feed the fusion projection, then how the Transformer encoder reasons over a 12-frame window before the classifier head emits the final 8-class prediction.

5.3.1.2 Feature Exploration

RI-3

Development Journey

The final architecture emerged through 9 major iterations, each addressing specific failure modes of the previous approach.

# Approach Accuracy Key Outcome
1 2D Pose + LSTM Validated pipeline. Training data angles did not match robot-mounted camera.
2 3D Pose (RealSense depth lifting) ~67% Boxing gloves occlude wrist keypoints at the moment detection matters most.
3 Raw RGBD + 3D CNNs (Swin3D, 88M params) 87.2% Strong recognition accuracy, but model was far too large for edge hardware and required Colab to train.
4 RGBD + Segmentation + Causal Transformer Still pixel-based, still required Colab for training.
5 Depth Feature Engineering (Z-Mask grid) 2D depth grid lost 3D spatial info. Hooks vs uppercuts indistinguishable.
6 Voxel + Colour Tracking ~96% Strong offline accuracy, but inflated by a high camera mount that captured the whole body, an angle the deployed robot cannot replicate without raising the camera well above the chassis (enlarging the footprint and needing extra mechanical bracing for stability). Also required red/green gloves, and HSV tracking broke whenever similar colours appeared in the frame.
7 Voxel-Only Transformer (12³) 80.6% Left/right confusion: 19/56 jabs misclassified as cross (34% error).
8 Voxel + Pose Fusion 97.3% Breakthrough. Pose eliminated left/right confusion completely, and unlike iter 6, the model now works with any boxing gloves (or even bare hands), since YOLO Pose locates the wrist joint visually rather than tracking a specific colour. Full circle, pose returned.
9 Production (v11, Jetson) 96.8% Block annotation fix, TensorRT optimisation, parallel processing. Deployed.

Table: Development iterations, each built on specific failure modes of the previous

How Three Phases Converged

The 9 iterations followed three phases of exploration. Each phase failed on its own, but taught something the final architecture needed:

Pose (iter 1–2)
1. 2D Pose + LSTM
✗ Wrong camera angle
2. 3D Pose (depth lifting, ~67%)
✗ Glove occlusion blocks wrists
Lesson: Skeleton joints carry hand identity, but pose alone is not reliable under occlusion.
Pose returns later ↓
Raw Pixels (iter 3–4)
3. Raw RGBD + 3D CNN (87.2%)
✗ 88M params, too large for Jetson
4. RGBD + Segmentation
✗ Still pixel-based, still required Colab
Lesson: Need compact, engineered features — raw pixels are too heavy for edge deployment.
Dead end — redirected the search
Voxels (iter 5–7)
5. Z-Mask (2D depth grid)
✗ Lost 3D spatial information
6. Voxel + Colour (~96%)
✗ Impractical coloured gloves
★ Insight: voxels need hand info
7. Voxel-Only (80.6%)
✗ Left/right confusion (34% jab error)
★ Insight: need pose back for L/R
Lesson: 3D motion is the right primary signal, but voxels alone cannot distinguish left from right hand.
Pose Voxel
8. Voxel + Pose Fusion, 97.3%
Pose solves left/right confusion. Voxels capture 3D motion. Full circle.
Works with any boxing gloves, no coloured gloves required, since YOLO finds the wrist visually.
9. Production (v11), 96.8% Deployed
TensorRT, block annotation fix, parallel threads

Feature Comparison

Over 9 iterations, different feature representations were explored to find the best way to capture boxing motion from a depth camera. The videos below show the same footage processed through each method, from raw pixels (1) through skeleton-based approaches (2–3), cropped and segmented inputs (4–5), depth-only grids (6–7), to colour-based glove tracking (8). Each approach revealed specific limitations that guided the next iteration, ultimately leading to the voxel + pose fusion architecture. Full technical details for each iteration are documented in Appendix 7.

1. RGBD Baseline
2. 2D Skeleton
3. 3D Skeleton + RGBD
4. Voxel Delta
5. RGBD Bbox Crop
6. Segmentation + Depth
7. Z-Mask Grid
8. Colour Tracking

Inference Results Across Iterations

The videos below show two key earlier iterations side by side. They are doing different tasks (see the recognition vs. prediction distinction at the top of this page), so their accuracies are not directly comparable.

Iter. 3: RGBD Model
Offline recognition (labels after the punch finishes)
87.2% test accuracy
Too large for edge, required Google Colab (a cloud-hosted notebook with a free GPU) to train
Iter. 6: Voxel + Colour Tracking
Real-time prediction (classifies while punch is in progress)
~96% accuracy
Required coloured gloves, impractical to deploy
GradCAM Heatmap (RGBD model)
Brighter red = where the model's attention is concentrated. Confirms it learned arms and shoulders, not background.
Takeaway: deep networks could learn punch patterns (GradCAM proves it), but the RGBD model was too heavy for edge and the colour-tracking approach needed impractical gloves. The final architecture had to combine both strengths: deep-network accuracy + real-time, glove-free operation = voxel + pose fusion.

Final Model Inference (Deployed v11)

Person 1, newer boxer, less consistent form
Person 2, trained boxer, cleaner technique

Both participants were unseen during training. Person 1's looser form lowers confidence on ambiguous punches, while Person 2's cleaner technique lets the model commit to the prediction before the punch lands. Both clips also show the model wobbling during punch retraction (not in the training annotations), a limitation that 5.3.2's CV + IMU fusion cleans up via pad-constraint filtering.

5.3.1.3 Data Collection and Training

RI-3

Recording Infrastructure

A custom recording tool captures from an Intel RealSense D435i mounted at ~1.5m height, 1–2m from the user. Footage is recorded at 60fps (960×540 RGB + 848×480 depth) and then downsampled to 30fps for training. Capturing at 60fps keeps each frame sharp by reducing motion blur during fast punches, while sampling down to 30fps gives the model enough temporal information to learn the motion without doubling the data and compute cost, the best of both.

Depth technology Active stereo IR with built-in projector (works in low light, struggles in direct sunlight)
Depth resolution / FPS Up to 1280×720 @ 90fps (project records 848×480 @ 60fps, downsampled to 30fps for training)
RGB resolution / FPS Up to 1920×1080 @ 30fps (project records 960×540 @ 60fps to match depth alignment, downsampled to 30fps)
Depth range / accuracy ~0.3–3m usable; ~±2% error at 2m, comfortably within the 1–2m boxing distance
Field of view (depth) 87° × 58°, wide enough to capture the full upper body at 1m
Built-in IMU 6-axis (BMI055), not used in this project (Tegra kernel module unavailable on Jetson, see limitations); pad IMUs handle motion sensing instead
Interface USB 3.0, single cable to Jetson
Critical pipeline fix, silent frame drops

Problem: early recordings silently dropped depth frames, ~15 fps actual versus 30 fps expected, because the laptop couldn't write float32 depth arrays to the external SSD fast enough. Every temporal voxel feature was being computed against corrupted timing.

Fix: a dedicated capture thread that runs independently of disk I/O, paired with an async writer thread feeding off a 300-frame queue buffer. Non-blocking enqueue plus frame-drop diagnostics caught any future regressions.

Cost: all earlier footage had to be re-collected and re-annotated from scratch.

The custom recording tool below wraps all of this, the dual-thread capture/writer pipeline, the diagnostics, and a few common camera presets, behind a single one-click interface so a recording session can start without any command-line setup.

Annotation

A custom timeline-based annotation tool (annotate_punches_gui.py) with drag-to-resize segments and auto-save every 30 seconds. Punch segments span from the first preparatory cue (shoulder drop, hip rotation) to full extension, retraction is excluded because it looks similar across punch types.

Almost all of the dataset was annotated manually by me, thousands of segments across the project. Every time something changed (camera angle, recording setup, labelling scheme), days of footage had to be re-recorded and re-annotated from scratch. The annotation GUI (keyboard-driven labelling, drag-to-resize, segment shortcuts, auto-save) was what made this re-annotation loop survivable at the iteration pace the project demanded.

Block annotation fix

Problem: block annotations originally covered only the onset (arms rising into guard). This caused two live failures, the rising motion looked like an uppercut, and once the user held the guard the model had never seen that pose, so predictions drifted to idle.

Fix: annotate the full guard sequence, rise, hold, and return , so the model learns every phase of a block.

Result: validation accuracy stayed effectively the same (v10: 97.3%, v11: 96.8%), but live block detection improved dramatically. The gain came entirely from the annotation change, not the model. Block remains the lowest class at 66.7% validation accuracy because it is the smallest class, but its real-world reliability is significantly higher.

The screenshot below shows the annotation GUI in action: each coloured bar on the timeline is one labelled punch segment, with start and end frames draggable to refine the boundaries.

Dataset

The dataset uses a leave-one-person-out validation strategy, the model is trained on data from two people and validated on a completely unseen third person. This directly tests whether the model generalises to new users (the real deployment requirement) rather than memorising individual movement patterns.

Split Segments Clips Notes
Training 808 7 2 people with consistent technique
Validation 221 1 Unseen person (leave-one-person-out)
Excluded 221 3 1 person removed, inconsistent form (see below)

8 action classes across all splits. Class imbalance (block is smallest) mitigated with class-balanced sampling (each epoch sees roughly equal numbers of every class) and focal loss (γ=1.5, a loss function that down-weights easy examples so the model focuses on rare/hard ones).

Small dataset, big constraint. Although more people contributed footage, only three of us had enough boxing experience to produce consistent technique, so the usable dataset came down to three subjects. With so few, each person's individual quirks became an outsized signal, the leave-one-person-out split left just a single unseen subject for validation, and rare classes like block ended up with only a handful of segments.

That the model still reaches 97.3% validation accuracy under these conditions is what makes the approach interesting, it also points to clear headroom. A larger, more diverse dataset from experienced boxers with cleaner technique would unlock better generalisation, more reliable rare-class detection, and room for deeper architectures the current data cannot support.

Key finding, data quality over architecture

Bottleneck: 6 different architectures were tested and all plateaued at around 92.8% validation accuracy. Hyperparameter tuning, augmentation strength, deeper transformers, none of it moved the needle.

Insight: the bottleneck wasn't the model. One contributor's inconsistent technique was poisoning training and validation alike. Removing their clips and re-validating on a different unseen person was the only thing that hadn't been tried.

Result: accuracy jumped from 92.8% to 97.3% (+4.5%), larger than any architecture change across all 6 models, and the single biggest accuracy gain in the entire project.

The chart below puts the +4.5% data curation jump next to the deltas from every architecture experiment for direct comparison, the data fix is visibly larger than the entire architecture-search effort combined.

Training

Parameter Value What it does
Epochs 300 (early stop patience 60) How many full passes over the training data the model is allowed; training stops automatically if validation accuracy hasn't improved in 60 epochs to avoid wasted compute and overfitting.
Batch size 16 Number of training samples processed together before updating the model weights, small enough to fit in GPU memory, large enough to give a stable gradient signal.
Optimiser AdamW (LR 0.0005, weight decay 0.01) The algorithm that adjusts the model weights each step. AdamW is a strong default for Transformers; weight decay gently shrinks unused weights to prevent overfitting.
LR schedule Warmup 20 epochs → cosine decay to 1e-6 How the learning rate changes over time. Starts small (warmup) so early steps don't destabilise the model, then smoothly decays so later steps make finer and finer adjustments.
Loss Focal loss (γ=1.5) + label smoothing (0.03) The function the model is trying to minimise. Focal loss down-weights easy examples so training focuses on hard/rare classes; label smoothing makes the model less overconfident, which improves generalisation.
Best epoch 73 / 133 trained The epoch at which validation accuracy peaked. The model trained for 133 epochs total but the checkpoint from epoch 73 was the one saved and deployed.

Key augmentation, horizontal flip (50%): exploits boxing's left/right symmetry by flipping the voxel X-axis, swapping left/right pose features, and swapping labels (jab↔cross, left_hook↔right_hook, left_uppercut↔right_uppercut), effectively doubling the dataset. Other augmentations: speed (0.8-1.2×), voxel shift/cutout/stretch, multi-window (9, 12 frames).

The end-to-end pipeline below shows how raw recordings move through annotation, feature extraction (voxel + pose), training on the workstation, and finally TensorRT deployment to the Jetson.

Architecture Search

Before finalising the model configuration, 8 variants were tested systematically to find the optimal balance of architecture, augmentation, and attention mode:

Variant Architecture Key Change Accuracy
v1 Causal, d=192 Heavy augmentation (all maxed) 86.0%
v2 Causal, d=192 Moderate augmentation 91.0%
v3 Causal, d=192 No horizontal flip 90.1%
v4 Causal, d=128, 2 layers Smaller model + mixup 88.7%
v5 Bidirectional, d=192 Non-causal attention 92.8%
v6 Causal, d=192 Temporal stats features 88.2%
v7 Causal, d=192 Explicit pose velocity 87.3%
v8 Conv1D temporal Replace attention with convolution 90.9%

Table: Architecture search, all variants tested before data curation

Key findings: heavy augmentation causes underfitting (v1 vs v2), smaller models underfit (v4), and Transformer attention outperforms Conv1D temporal encoding (v8). These results preceded the +4.5% data curation improvement, which pushed the final model to 97.3%.

5.3.1.4 Ablation Study

RI-3
Mode Validation Accuracy What It Captures
Pose only 44.1% Joint positions but no spatial/3D motion context
Voxel only 80.6% 3D motion trajectories but cannot distinguish left from right
Both fused 97.3% Complete picture, spatial motion + joint-level discrimination

The branches provide genuinely complementary information. Without pose, 34% of jabs were misclassified as cross because both produce identical "straight forward" voxel patterns, only wrist positions from the pose stream reveal which hand is punching. The voxel branch in turn captures 3D arc trajectories that distinguish hooks from uppercuts, which 2D pose alone cannot represent. Fusion eliminated left/right confusion completely, yielding a +16.7% gain over voxel-only.

5.3.1.5 Secondary CV Functions

RI-1, RI-2, RI-8

The same YOLO pose output and depth bounding box that feed the action model are reused by two lighter-weight functions, neither runs an independent perception stack, so they add negligible compute on top of what the action model already costs.

Both functions read directly off the bounding box and 7 upper-body keypoints shown below , the same per-frame YOLO output the action model already consumes.