Back to Main Report

5.3 Robot Intelligence (Yogeeswaran)

Robot Intelligence is the AI brain that turns BoxBunny from a static striking target into an adaptive sparring partner. It sees what the boxer is doing in real time, decides how the robot should respond, drives the LLM coach that gives feedback, and surfaces every session as analytics on a phone dashboard.

At its core is a computer vision pipeline that classifies 8 boxing actions (6 punch types, block, and idle) using depth-based 3D voxel motion fused with 2D pose estimation. A ROS 2 workspace on an NVIDIA Jetson Orin NX orchestrates that perception output alongside IMU sensing, an adaptive sparring engine, and a local Gemma 4 E2B LLM coach, with everything running on-device, no cloud, no internet required.

Section structure
  • This page: requirements, hardware, ROS 2 system architecture
  • 5.3.1: Computer Vision & action prediction model
  • 5.3.2: Sensor fusion, sparring AI, LLM coach
  • 5.3.3: Dashboard backend & analytics
  • 5.3.4: Testing & verification of all 8 requirements

Where this subsystem sits in BoxBunny

BoxBunny has three sibling subsystems wired through ROS 2. Robot Intelligence sits in the middle, deciding what the robot does next.

5.1 GUI
User input & live feedback
Touchscreen on the robot. User inputs in, live session state out.
5.3 Robot Intelligence · this section
The decision-making brain
CV, sensor fusion, sparring AI, LLM coach, sensor data to commands.
5.2 Robot Mechanism
Physical actuation & sensing
Punch commands in, pad impact events out.

Together, the three fulfil the product needs from the user research: Intelligent Sparring System (5.3.1 + 5.3.2), Adaptive Fight Intelligence (5.3.2), Skill Progression Studio (5.3.2), and Performance Analytics (5.3.3).


Key Achievements

Action Prediction Model
96.8% validation accuracy on an unseen person
Classifies 8 boxing actions at 30 fps on an NVIDIA Jetson. Built over 9 iterations as a dual-branch (voxel + pose) Transformer, full story in 5.3.1.
ROS 2 System
10 ROS 2 nodes · 21 messages · 6 services
Orchestrates CV inference, IMU sensor fusion, 5 adaptive AI sparring styles with weakness tracking, and a local Gemma 4 E2B LLM coach running entirely on-device.
Dashboard & Analytics
Vue 3 + FastAPI
Phone dashboard for remote training control, performance trends, AI coach chat, and population benchmarking. Public-URL tunnel lets any phone connect from any network, no shared WiFi required.
Testing & Validation
146 unit + 28 integration tests
Real-world deployment with public users at the CDE AI Fair, full model interpretability analysis, and all 8 requirements verified in 5.3.4.
GitHub Repositories: Action Recognition Robot Workspace

Requirements and Design Considerations

This subsystem addresses DO-1 (Performance Analytics), DO-2 (Intelligent Sparring System), DO-3 (Skill Progression Studio), and DO-4 (Adaptive Fight Intelligence). See Section 5 for the full Design Objectives reference.

These requirements were derived from the Function Analysis, user journey, and the physical constraints of deploying real-time computer vision on an embedded platform during live boxing training.

ID Requirement Rationale
RI-1 Classify 8 boxing actions in real-time at >=30 FPS Match human defensive reaction speed; voxel temporal features depend on consistent 30fps frame intervals
RI-2 End-to-end latency (camera to motor command) <=150ms Matches average intermediate boxer defensive reaction time
RI-3 Classification accuracy >=90% on unseen person validation set Reduce false triggers and incorrect counter-punches during live sparring
RI-4 CV + IMU sensor fusion for confirmed punch events Single-source detection unreliable; fusion eliminates false positives from camera noise and accidental pad contacts
RI-5 All inference on-device (NVIDIA Jetson Orin NX 16 GB), no cloud Gym WiFi unreliable; latency budget prohibits cloud round-trip; zero internet requirement
RI-6 Multiple adaptive AI sparring styles with weakness tracking Prevent user habituation; support varied training objectives
RI-7 Real-time coaching feedback via local LLM Personalized technique guidance without cloud dependency
RI-8 Real-time person detection and user tracking for yaw and height motor control Robot must face the user during training; user adjusts height via phone or GUI before each session using CV-provided height data

Table: Robot Intelligence Requirements and Design Considerations

System Design Narrative

Following the V-Model from Section 3.2, RI-1 to RI-8 were fixed before any model architecture, ROS node, or LLM prompt was written. The diagram below shows how each requirement decomposed into a design decision (left arm), was built and integrated (base), and then verified (right arm). The table underneath maps each V-Model stage to the specific BoxBunny RI work.

BoxBunny Robot Intelligence, Systems Engineering V-Model DESIGN DECOMPOSITION VERIFICATION AND VALIDATION 01 RI Problem Framing RI-1 to RI-8 fixed 02 System Architecture 10 ROS 2 nodes, dual-branch CV, LLM 03 Component Design CV model, sensor fusion, sparring AI, LLM prompt RI Integration Build All 10 nodes wired, dashboard live, model deployed 04 Component Verification 146 unit tests, model interpretability, ablation study 05 Integration Testing 28 cross-module tests, CDE AI Fair 06 System Validation Full robot test, all 8 RI verified Design decomposition Verification and validation Integration build stage Requirement to verification correspondence Figure: Robot Intelligence Systems Engineering V-Model showing requirement decomposition (left), integration build (centre), and verification closure (right) for RI-1 to RI-8.
V-Model Stage Robot Intelligence Focus Applied to BoxBunny RI Verification Evidence
Concept & Requirements Fix measurable targets before any design work begins RI-1 to RI-8 derived from product needs and on-device deployment constraints. Each requirement drove a specific design choice downstream (see next rows) Requirements table above with rationale per requirement
High-Level Design Partition the AI stack into independent ROS 2 nodes coordinated by a central state machine
10-node architecture, 21 custom messages, 6 services:
cv_node imu_node punch_processor session_manager drill_manager sparring_engine llm_node analytics_node free_training_engine robot_node
The 5-state session machine is the single source of truth: all training modes flow through the same lifecycle
Node graph (below) + session state machine in 5.3.2
Detailed Design & Build Design, train, and deploy each AI component, each driven by a specific RI requirement RI-1/2: ~24 ms per-frame budget shaped the dual-branch Transformer + TensorRT FP16 deployment
RI-3: 9-iteration CV model with leave-one-person-out validation
RI-4: punch_processor CV+IMU fusion with pad-constraint filtering
RI-5: all inference on-device within the 16 GB shared-memory budget; adaptive frame rate (6 Hz idle / 30 Hz active) shares GPU between CV and LLM
RI-6: sparring_engine with Markov chain styles + persistent weakness profile
RI-7: local Gemma 4 E2B LLM coach (multimodal, edge-optimised) with 17-document boxing knowledge base + fallback tips
RI-8: cv_node publishes UserTracking each frame; yaw motor faces the user, height adjusted by the user via phone or GUI
Vue 3 + FastAPI phone dashboard connected to per-user SQLite database
Per-component design narratives in 5.3.1, 5.3.2, 5.3.3
Verification & Validation Three-tier testing pyramid + real-world deployment to confirm RI-1 to RI-8 Unit: 146 pytest tests (sensor fusion, gamification, database)
Integration: 28 cross-module tests (ROS messages, fusion end-to-end, motor protocol) + CDE AI Fair early prototype deployment
System: Full robot test with real camera, IMU, and motors + model interpretability analysis
5.3.4.7 verification summary: all 8 RI requirements pass

Table: Robot Intelligence application of the V-Model from decomposition to verification

Hardware & Tech Stack

Everything below runs on a single NVIDIA Jetson Orin NX, one board carrying perception, fusion, decision logic, the LLM coach, the dashboard backend, and the touchscreen GUI all at once. No external server, no cloud, no internet required.

Hardware
Jetson Orin NX 16GB Ubuntu 22.04 CUDA 12.6 RealSense D435i Teensy 4.0 4× MPU6050
Models
Voxel-Pose Transformer 1.75M YOLO Pose nano Gemma 4 E2B GGUF Q4_K_M PyTorch → TensorRT FP16 llama-cpp-python
Robotics
ROS 2 Humble 10 nodes 21 messages 6 services
Backend
FastAPI + Uvicorn SQLite (two-tier) WebSocket localhost.run tunnel
Frontends
PySide6 / Qt, 10.1″ touchscreen Vue 3 + Tailwind, phone dashboard

Open-Source Licensing

If BoxBunny were deployed as a commercial product, the open-source licences in the stack would need to be addressed. Most of the stack is permissive (Apache 2.0, MIT, BSD), but two components carry stricter terms:

Component Licence Commercial implication
YOLO26 (Ultralytics) AGPL-3.0 Requires open-sourcing the full application if distributed, or purchasing an Ultralytics Enterprise licence (~US$5,000/year per developer, quote-based (Ultralytics, 2024))
PySide6 (Qt) LGPL OK if dynamically linked (the default), but must allow users to replace the Qt library
Everything else
(Gemma 4, llama-cpp, ROS 2, FastAPI, Vue 3)
Apache 2.0 / MIT / BSD Permissive, no issues
Why YOLO despite AGPL?
The constraint
The system's 33 ms per-frame budget leaves no room for a slower pose estimator. YOLO26 Nano Pose (yolo26n-pose, Jocher et al., 2025) is the fastest open-source option available: its key breakthrough is NMS-free end-to-end inference (no post-processing step, final detections in a single forward pass), which combined with progressive loss balancing and edge-optimised architecture runs at ~16 ms (TensorRT FP16) on the Jetson. No other open-source pose estimator fits the budget.
The commercial path
For deployment, two options: purchase an Ultralytics Enterprise licence, or swap YOLO for an Apache/MIT-licensed pose estimator (e.g. MediaPipe Pose, MMPose) with pipeline re-tuning.
What's unaffected
The rest of the action prediction model (voxel branch, transformer, fusion layer) is entirely custom code with no licence constraints. Only the pose estimator component carries the AGPL dependency.

ROS 2 System Architecture

The entire system runs on an NVIDIA Jetson Orin NX using ROS 2 Humble (Robot Operating System 2), an industry-standard framework for building robotics software as a network of independent, communicating processes called nodes. ROS 2 was chosen because it provides reliable real-time messaging between components, allows each subsystem (CV, fusion, sparring, LLM) to be developed and tested independently, and is the standard for robotics projects requiring sensor integration and motor control.

The system consists of 10 ROS 2 nodes communicating via 21 custom message types and 6 services. Every confirmed punch flows through four stages from camera to robot behaviour:

Data Flow: Camera to Robot Behaviour (4 stages) 1. CAPTURE RealSense D435i RGB + depth 848×480 @ 60 fps camera_node 2. PERCEPTION YOLO pose + 3D voxel motion → 8-class action sampled @ 30 fps cv_node 3. FUSION CV ↔ IMU match ±500 ms window + pad constraints punch_processor 4. DOWNSTREAM Session · Drills Sparring AI Analytics · LLM Robot motors 5 nodes IMU SENSORS (Teensy 4.0) 4× MPU6050: 4 pad sensors (centre, left, right, head) imu_node: pad impact detection, force classification serial → ROS HARDWARE & CONNECTION SUMMARY Jetson Orin NX (16 GB) Camera (USB 3.0) · Display (HDMI) WiFi AP · All ROS nodes · GPU inference Teensy 4.0 4× MPU6050 pads (I2C) · Serial to Jetson Pad impact detection · Force classification Phone Dashboard Jetson WiFi AP → FastAPI (HTTP/WS) → GUI via /tmp/ IPC files

Zooming out, the full ROS 2 node graph below shows every node, every custom topic, and every service in the system, including the side branches (analytics, drill manager, gesture node, robot control) that the 4-stage flow above abstracts away.

Coordinating all 10 nodes during a live session is the job of the session_manager node and its 5-state machine, covered in detail in 5.3.2.5 Session Management.

Subsections

Appendix 7: Robot Intelligence