IS-431: 5.3.1 CV & Action Prediction

5.3.1 CV & Action Prediction

RI-1, RI-3, RI-5

Problem ▸ Architecture ▸ 9 iterations ▸ Data & training ▸ Ablation ▸ Secondary

The CV subsystem delivers three functions: the action prediction model (core), reaction time detection, and human tracking. The latter two consume processed frames from the action prediction pipeline rather than running independent perception stacks.

Performance Analytics

Unified punch power, accuracy, and timing metrics with sensor-CV synchronisation.

⟶

⏱

Reaction Time

Adaptive Fight Intelligence

Counterpunching that adapts dynamically to user attack patterns and session performance.

⟶

🚶

Human Tracking

Intelligent Sparring System

Active arm actuation and haptic feedback simulating real incoming strikes, triggered by CV.

⟶

🥊

Counter-punching

_{Figure: Product needs and the CV functions that feed them. The CV pipeline provides the data; downstream subsections (5.3.2, 5.3.3) fulfil the needs}

The Problem: Predicting Punches Before They Land

All three functions above depend on one thing: the CV model must know what the user is doing before the action finishes. Most action recognition research classifies actions after they are fully performed. This project requires action prediction: identifying the punch type while the motion is still in progress, early enough for the robot to counter-punch. A typical jab completes in under 300 ms.

Action recognition (standard)

Watch the entire clip, then label what happened. Sees the full motion before deciding.

Action prediction (this project)

Classify frame-by-frame using only the frames seen so far (causal). Must commit before the punch finishes.

This means the model must handle punches at very different speeds: a fast jab and a slow hook look nothing alike at any given frame. Three mechanisms address this:

Depth-based motion detection: the voxel branch uses the depth camera to sense the hand moving forward in 3D, which is the earliest physical signal that a punch is starting, regardless of speed.
Dual-speed temporal channels: the voxel branch reads motion at two timescales, one sensitive to a sudden punch onset and one that captures the full arc, so fast jabs and slow hooks are both resolved (details in 5.3.1.1 Voxel Branch).
Causal Transformer + multi-window training: a causal Transformer (Vaswani et al., 2017) lets the attention mechanism dynamically weigh recent vs. earlier frames instead of relying on a fixed window. During training, the model sees windows of both 9 and 12 frames, forcing it to generalise across different punch durations.

Physical Constraints

The camera setup introduces additional challenges not found in typical action recognition datasets:

Single forward-facing camera	One Intel RealSense D435i (RGB + depth) mounted on the robot, facing the user. Budget constraints rule out multi-camera setups typical in sports analytics, so the system must work from a single viewpoint.
Punches toward camera	Unlike side-view datasets, punches travel directly toward the camera. This causes heavy depth foreshortening and makes some punches appear identical from this angle.
Glove occlusion	Boxing gloves block the wrist joints at the exact moment of punch extension, precisely when detection matters most for both pose estimation and depth sensing.
Partial body visible	The camera sees only the upper body (shoulders to head), not the full body or footwork.
Edge deployment	All inference must run on an NVIDIA Jetson Orin NX (16 GB) at 30fps, no cloud, no external GPU.

The model classifies 8 actions: jab, cross, left hook, right hook, left uppercut, right uppercut, block, and idle (the neutral baseline that prevents false positives when the user is not punching).

5.3.1.1 Model Architecture

RI-1, RI-3

FusionVoxelPoseTransformerModel (~1.75M parameters)

The model combines two complementary ways of understanding a punch, then uses a Transformer to reason about how the motion unfolds over time.

Component	Input → Output	Params	Role
Voxel Branch (Conv3D stem)	12³×2 motion channels → 192-dim	169,200 (9.7%)	Captures where the body moved in 3D space
Pose Branch (encoder)	42 joint features → 64-dim	6,016 (0.3%)	Captures which body part moved and how
Fusion projection	256-dim (concatenated) → 192-dim	49,728 (2.8%)	Merges the two branches into a unified representation
Transformer encoder	12 frames (0.4 s) → 384-dim	1,483,776 (84.9%)	Reasons over how the motion unfolds across time (4 layers, 8 heads, d=192)
Classifier head	384-dim summary → 8 action classes	38,504 (2.2%)	Final prediction: jab, cross, hooks, uppercuts, block, or idle
Total	–	1,747,224	Small enough to run at ~8 ms per frame on the Jetson

Where the parameters live: ~85% sit in the Transformer encoder (the temporal reasoning core), with the rest split across the Voxel Conv3D stem, the small Pose encoder, and the classifier head. The Pose branch in particular is tiny (~6K params) because it only has to project 42 hand-crafted joint features into the fusion space, most of the heavy lifting is done in the Transformer.

Voxel Branch, 3D Motion Sensing

What is a voxel?

A voxel is the 3D equivalent of a pixel. Just as a pixel is a tiny square that holds a colour value at one (x, y) location on a flat image, a voxel is a tiny cube that holds a value at one (x, y, z) location in 3D space.

Stack enough of them together and you get a 3D grid, the same way a 2D image is a grid of pixels. In this project, the depth camera turns the space in front of it into a 12×12×12 grid of voxels (each roughly 13 cm on a side), and each voxel records how much the user's body has moved through that little region of space between frames. That's how the model "sees" a punch in 3D.

With that out of the way: this branch divides the space around the user into the 12×12×12 voxel grid described above, creating a 3D snapshot of where the body is at each moment. To detect motion, the model compares these snapshots across time using two channels:

Fast channel (2-frame difference, ~67ms), detects the sudden onset of a punch
Slow channel (8-frame difference, ~267ms), captures the full arc of the punch trajectory

Together, these two channels tell the model both when a punch started and what shape the motion followed, a hook sweeps horizontally while an uppercut rises vertically. A 3D convolutional neural network (Conv3D) compresses this spatial information into a compact 192-dimensional vector per frame.

Design rationale

Grid resolution, 12³ (3,456 dims), the lowest resolution that still captured punch motion with enough spatial detail for accurate classification while fitting within the Jetson's real-time budget.

Channel reduction, earlier iterations used 10 voxel channels (directional gradients dx/dy/dz + velocity magnitude), but once pose was added these became redundant: pose already provides explicit directional information, so the voxel was cut to just 2 delta channels with no accuracy loss.

Spatial conditioning, the grid is person-centred (anchored to the user's median position with exponential smoothing), depth-weighted so closer voxels carry a stronger signal, and background depth is removed using a 90th-percentile model from the first 30 frames so only the user's body motion remains.

Pose Branch, Body Joint Tracking

A YOLO26 Nano Pose model (yolo26n-pose, Jocher et al., 2025) detects 7 upper-body joints (nose, shoulders, elbows, wrists) from the RGB camera image at ~16 ms per frame on the Jetson (see landing page for why YOLO26 was chosen). From these joint positions, 42 features are computed: normalised coordinates, confidence scores, arm extension ratios, shoulder rotation, elbow bend angles, and joint velocities. The same YOLO model is used during both training and live inference to prevent any mismatch between the two.

The Pose Estimation Paradox

With both branches now defined, the natural question is why the model needs both at all, rather than relying on pose alone (the standard approach for body-movement recognition). Off-the-shelf YOLO Pose is trained on COCO, a dataset where hands are bare and clearly visible. Boxing gloves remove everything the model expects to see:

What YOLO expects

Bare skin, fingers, fine joint contours. Wrists are always visible.

What boxing gives it

Thick padded gloves burying the wrists. Nothing recognisable to lock onto at punch extension.

The skeleton breaks or jumps at exactly the wrong moment: during punch extension, when the discriminative motion is happening. Pose estimation alone peaked at ~67% accuracy in early iterations.

Fix 1: Confidence gating: YOLO reports a confidence for each joint. When confidence drops (as it does during extension), the model automatically scales down the pose signal and leans on the voxel branch instead, so broken skeleton data never actively misleads the network.
Fix 2: The key insight, pose fails during extension, not preparation: In the first 100–200 ms before the arm extends, the joints are still visible and they reliably reveal which hand is moving, information voxels alone cannot provide. The model gets the critical left/right call early from pose, then smoothly transitions to voxel-only reasoning as the punch extends, which is precisely what the dual-branch fusion below is designed to exploit.

In practice the confidence gating above is reinforced during training by dropping the pose input entirely for 15% of clips and zeroing individual joints 10% of the time, so the network never becomes dependent on a pose signal it cannot trust at inference.

The diagram below shows how the two branches feed the fusion projection, then how the Transformer encoder reasons over a 12-frame window before the classifier head emits the final 8-class prediction.

5.3.1.2 Feature Exploration

RI-3

Development Journey

The final architecture emerged through 9 major iterations, each addressing specific failure modes of the previous approach.

#	Approach	Accuracy	Key Outcome
1	2D Pose + LSTM	–	Validated pipeline. Training data angles did not match robot-mounted camera.
2	3D Pose (RealSense depth lifting)	~67%	Boxing gloves occlude wrist keypoints at the moment detection matters most.
3	Raw RGBD + 3D CNNs (Swin3D, 88M params)	87.2%	Strong recognition accuracy, but model was far too large for edge hardware and required Colab to train.
4	RGBD + Segmentation + Causal Transformer	–	Still pixel-based, still required Colab for training.
5	Depth Feature Engineering (Z-Mask grid)	–	2D depth grid lost 3D spatial info. Hooks vs uppercuts indistinguishable.
6	Voxel + Colour Tracking	~96%	Strong offline accuracy, but inflated by a high camera mount that captured the whole body, an angle the deployed robot cannot replicate without raising the camera well above the chassis (enlarging the footprint and needing extra mechanical bracing for stability). Also required red/green gloves, and HSV tracking broke whenever similar colours appeared in the frame.
7	Voxel-Only Transformer (12³)	80.6%	Left/right confusion: 19/56 jabs misclassified as cross (34% error).
8	Voxel + Pose Fusion	97.3%	Breakthrough. Pose eliminated left/right confusion completely, and unlike iter 6, the model now works with any boxing gloves (or even bare hands), since YOLO Pose locates the wrist joint visually rather than tracking a specific colour. Full circle, pose returned.
9	Production (v11, Jetson)	96.8%	Block annotation fix, TensorRT optimisation, parallel processing. Deployed.

Table: Development iterations, each built on specific failure modes of the previous

How Three Phases Converged

The 9 iterations followed three phases of exploration. Each phase failed on its own, but taught something the final architecture needed:

Pose (iter 1–2)

↓

1. 2D Pose + LSTM
✗ Wrong camera angle

↓

2. 3D Pose (depth lifting, ~67%)
✗ Glove occlusion blocks wrists

↓

Lesson: Skeleton joints carry hand identity, but pose alone is not reliable under occlusion.

…

Pose returns later ↓

Raw Pixels (iter 3–4)

↓

3. Raw RGBD + 3D CNN (87.2%)
✗ 88M params, too large for Jetson

↓

4. RGBD + Segmentation
✗ Still pixel-based, still required Colab

↓

Lesson: Need compact, engineered features; raw pixels are too heavy for edge deployment.

Dead end, redirected the search

Voxels (iter 5–7)

↓

5. Z-Mask (2D depth grid)
✗ Lost 3D spatial information

↓

6. Voxel + Colour (~96%)
✗ Impractical coloured gloves
★ Insight: voxels need hand info

↓

7. Voxel-Only (80.6%)
✗ Left/right confusion (34% jab error)
★ Insight: need pose back for L/R

↓

Lesson: 3D motion is the right primary signal, but voxels alone cannot distinguish left from right hand.

8. Voxel + Pose Fusion, 97.3%
Pose solves left/right confusion. Voxels capture 3D motion. Full circle.
Works with any boxing gloves, no coloured gloves required, since YOLO finds the wrist visually.

↓

9. Production (v11), 96.8% Deployed
TensorRT, block annotation fix, parallel threads

Feature Comparison

Over 9 iterations, different feature representations were explored to find the best way to capture boxing motion from a depth camera. The videos below show the same footage processed through each method, from raw pixels (1) through skeleton-based approaches (2–3), cropped and segmented inputs (4–5), depth-only grids (6–7), to colour-based glove tracking (8). Each approach revealed specific limitations that guided the next iteration, ultimately leading to the voxel + pose fusion architecture. Full technical details for each iteration are documented in Appendix 7.

_{1. RGBD Baseline}

_{2. 2D Skeleton}

_{3. 3D Skeleton + RGBD}

_{4. Voxel Delta}

_{5. RGBD Bbox Crop}

_{6. Segmentation + Depth}

_{7. Z-Mask Grid}

_{8. Colour Tracking}

Inference Results Across Iterations

The videos below show two key earlier iterations side by side. The RGBD model (left) is too heavy to run live on the Jetson, so inference is shown offline on recorded clips; the lighter voxel+colour model (right) runs live on-device.

Iter. 3: RGBD Model

Offline recognition (labels after the punch finishes)
87.2% test accuracy
Too large for edge, required Google Colab (a cloud-hosted notebook with a free GPU) to train

Iter. 6: Voxel + Colour Tracking

Real-time prediction (classifies while punch is in progress)
~96% accuracy
Required coloured gloves, impractical to deploy

GradCAM Heatmap (RGBD model)

Brighter red = where the model's attention is concentrated. Confirms it learned arms and shoulders, not background.

Takeaway: deep networks could learn punch patterns (GradCAM proves it), but the RGBD model was too heavy for edge and the colour-tracking approach needed impractical gloves. The final architecture had to combine both strengths: deep-network accuracy + real-time, glove-free operation = voxel + pose fusion.

Final Model Inference (Deployed v11)

_{Person 1, newer boxer, less consistent form}

_{Person 2, trained boxer, cleaner technique}

Both participants were unseen during training. Person 1's looser form lowers confidence on ambiguous punches, while Person 2's cleaner technique lets the model commit to the prediction before the punch lands. Both clips also show the model wobbling during punch retraction (not in the training annotations), a limitation that 5.3.2's CV + IMU fusion cleans up via pad-constraint filtering.

5.3.1.3 Data Collection and Training

RI-3

Recording Infrastructure

A custom recording tool captures from an Intel RealSense D435i mounted at ~1.5m height, 1–2m from the user. Footage is recorded at 60fps (960×540 RGB + 848×480 depth) and then downsampled to 30fps for training. Capturing at 60fps keeps each frame sharp by reducing motion blur during fast punches, while sampling down to 30fps gives the model enough temporal information to learn the motion without doubling the data and compute cost, the best of both.

Depth technology	Active stereo IR with built-in projector (works in low light, struggles in direct sunlight)
Depth resolution / FPS	Up to 1280×720 @ 90fps (project records 848×480 @ 60fps, downsampled to 30fps for training)
RGB resolution / FPS	Up to 1920×1080 @ 30fps (project records 960×540 @ 60fps to match depth alignment, downsampled to 30fps)
Depth range / accuracy	~0.3–3m usable; ~±2% error at 2m, comfortably within the 1–2m boxing distance
Field of view (depth)	87° × 58°, wide enough to capture the full upper body at 1m
Built-in IMU	6-axis (BMI055), not used in this project (Tegra kernel module unavailable on Jetson, see limitations); pad IMUs handle motion sensing instead
Interface	USB 3.0, single cable to Jetson

Critical pipeline fix, silent frame drops

Problem: early recordings silently dropped depth frames, ~15 fps actual versus 30 fps expected, because the laptop couldn't write float32 depth arrays to the external SSD fast enough. Every temporal voxel feature was being computed against corrupted timing.

Fix: a dedicated capture thread that runs independently of disk I/O, paired with an async writer thread feeding off a 300-frame queue buffer. Non-blocking enqueue plus frame-drop diagnostics caught any future regressions.

Cost: all earlier footage had to be re-collected and re-annotated from scratch.

The custom recording tool below wraps all of this, the dual-thread capture/writer pipeline, the diagnostics, and a few common camera presets, behind a single one-click interface so a recording session can start without any command-line setup.

Annotation

A custom timeline-based annotation tool (annotate_punches_gui.py) with drag-to-resize segments and auto-save every 30 seconds. Punch segments span from the first preparatory cue (shoulder drop, hip rotation) to full extension, retraction is excluded because it looks similar across punch types.

Almost all of the dataset was annotated manually by me, thousands of segments across the project. Every time something changed (camera angle, recording setup, labelling scheme), days of footage had to be re-recorded and re-annotated from scratch. The annotation GUI (keyboard-driven labelling, drag-to-resize, segment shortcuts, auto-save) was what made this re-annotation loop survivable at the iteration pace the project demanded.

Block annotation fix

Problem: block annotations originally covered only the onset (arms rising into guard). This caused two live failures, the rising motion looked like an uppercut, and once the user held the guard the model had never seen that pose, so predictions drifted to idle.

Fix: annotate the full guard sequence, rise, hold, and return , so the model learns every phase of a block.

Result: validation accuracy stayed effectively the same (v10: 97.3%, v11: 96.8%), but live block detection improved dramatically. The gain came entirely from the annotation change, not the model. Block remains the lowest class at 66.7% validation accuracy because it is the smallest class, but its real-world reliability is significantly higher.

The screenshot below shows the annotation GUI in action: each coloured bar on the timeline is one labelled punch segment, with start and end frames draggable to refine the boundaries.

Dataset

The dataset uses a leave-one-person-out validation strategy, the model is trained on data from two people and validated on a completely unseen third person. This directly tests whether the model generalises to new users (the real deployment requirement) rather than memorising individual movement patterns.

Split	Segments	Clips	Notes
Training	808	7	2 people with consistent technique
Validation	221	1	Unseen person (leave-one-person-out)
Excluded	221	3	1 person removed, inconsistent form (see below)

8 action classes across all splits. Class imbalance (block is smallest) mitigated with class-balanced sampling (each epoch sees roughly equal numbers of every class) and focal loss (γ=1.5, a loss function that down-weights easy examples so the model focuses on rare/hard ones).

Small dataset, big constraint. Although more people contributed footage, only three of us had enough boxing experience to produce consistent technique, so the usable dataset came down to three subjects. With so few, each person's individual quirks became an outsized signal, the leave-one-person-out split left just a single unseen subject for validation, and rare classes like block ended up with only a handful of segments.

That the model still reaches 97.3% validation accuracy under these conditions is what makes the approach interesting, it also points to clear headroom. A larger, more diverse dataset from experienced boxers with cleaner technique would unlock better generalisation, more reliable rare-class detection, and room for deeper architectures the current data cannot support.

Key finding, data quality over architecture

Bottleneck: 6 different architectures were tested and all plateaued at around 92.8% validation accuracy. Hyperparameter tuning, augmentation strength, deeper transformers, none of it moved the needle.

Insight: the bottleneck wasn't the model. One contributor's inconsistent technique was poisoning training and validation alike. Removing their clips and re-validating on a different unseen person was the only thing that hadn't been tried.

Result: accuracy jumped from 92.8% to 97.3% (+4.5%), larger than any architecture change across all 6 models, and the single biggest accuracy gain in the entire project.

The chart below puts the +4.5% data curation jump next to the deltas from every architecture experiment for direct comparison, the data fix is visibly larger than the entire architecture-search effort combined.

Training

The model was trained for 133 epochs with AdamW (LR 0.0005, weight decay 0.01), a warmup + cosine learning-rate schedule, and focal loss + label smoothing to handle class imbalance. The deployed checkpoint came from epoch 73, the best validation accuracy point. Horizontal flip (50%) was the key augmentation, exploiting boxing's left/right symmetry to effectively double the dataset; training used a fixed seed of 42 for reproducibility. Full hyperparameter table in Appendix 7.I.

Architecture Search

Before finalising the configuration, 8 variants were tested to find the best balance of attention mode, model size, and augmentation strength. Key takeaways: heavy augmentation underfits (v1 86.0% vs v2 91.0%), smaller models underfit (v4 88.7%), and Transformer attention outperforms Conv1D temporal encoding (v8 90.9%). These architecture sweeps preceded the +4.5% data-curation improvement that pushed the final model to 97.3%. Full 8-variant table in Appendix 7.I.

5.3.1.4 Ablation Study

RI-3

Mode	Validation Accuracy	What It Captures
Pose only	44.1%	Joint positions but no spatial/3D motion context
Voxel only	80.6%	3D motion trajectories but cannot distinguish left from right
Both fused	97.3%	Complete picture, spatial motion + joint-level discrimination

The branches provide genuinely complementary information. Without pose, 34% of jabs were misclassified as cross because both produce identical "straight forward" voxel patterns, only wrist positions from the pose stream reveal which hand is punching. The voxel branch in turn captures 3D arc trajectories that distinguish hooks from uppercuts, which 2D pose alone cannot represent. Fusion eliminated left/right confusion completely, yielding a +16.7% gain over voxel-only.

5.3.1.5 Secondary CV Functions

RI-1, RI-2

Two lighter-weight functions reuse the same per-frame YOLO pose output and depth bounding box that feed the action model, so they add negligible compute on top of what the action model already costs.

Human Tracking

What it publishes. Bounding-box centre, width, depth, and lateral/depth displacement, emitted every frame as a UserTracking message.

What uses it. The yaw motor rotates to face the user as they move laterally, and the slips / dodges classifier uses the same displacement signal for defence detection.

Height adjustment. Manual in the current system; using CV-derived user height to drive the height mechanism automatically is future work in Section 6.3.

Validation. Off-hardware via the Teensy simulator (5.3.4.6), on-hardware via the yaw-tracking video in 5.3.4.5 (safe rotation speed demonstrated).

Reaction-Time Detection

How it works. When a stimulus fires, the system timestamps the moment the user's hand depth crosses a calibrated threshold toward the robot, giving millisecond-accurate reaction times.

Why depth-based. Depth-based proximity replaces the velocity thresholds that proved unreliable at the CDE Fair.

Validation. The reaction-time drill video in 5.3.4.5 shows end-to-end behaviour and sensitivity tuning that correctly ignores small body movements.

Back: Robot Intelligence

Next: Intelligent Behaviour & Control