5.3 Robot Intelligence (Yogeeswaran)
Robot Intelligence is the AI brain that turns BoxBunny from a static striking target into an adaptive sparring partner. It sees what the boxer is doing in real time, decides how the robot should respond, drives the LLM coach that gives feedback, and surfaces every session as analytics on a phone dashboard.
At its core is a computer vision pipeline that classifies 8 boxing actions (6 punch types, block, and idle) using depth-based 3D voxel motion fused with 2D pose estimation. A ROS 2 workspace on an NVIDIA Jetson Orin NX orchestrates that perception output alongside IMU sensing, an adaptive sparring engine, and a local Gemma 4 E2B LLM coach, with everything running on-device, no cloud, no internet required.
- This page: requirements, hardware, ROS 2 system architecture
- 5.3.1: Computer Vision & action prediction model
- 5.3.2: Sensor fusion, sparring AI, LLM coach
- 5.3.3: Dashboard backend & analytics
- 5.3.4: Testing & verification of all 8 requirements
Where this subsystem sits in BoxBunny
BoxBunny has three sibling subsystems wired through ROS 2. Robot Intelligence sits in the middle, deciding what the robot does next.
Together, the three fulfil the product needs from the user research: Intelligent Sparring System (5.3.1 + 5.3.2), Adaptive Fight Intelligence (5.3.2), Skill Progression Studio (5.3.2), and Performance Analytics (5.3.3).
Key Achievements
Requirements and Design Considerations
This subsystem addresses DO-1 (Performance Analytics), DO-2 (Intelligent Sparring System), DO-3 (Skill Progression Studio), and DO-4 (Adaptive Fight Intelligence). See Section 5 for the full Design Objectives reference.
These requirements were derived from the Function Analysis, user journey, and the physical constraints of deploying real-time computer vision on an embedded platform during live boxing training.
| ID | Requirement | Rationale |
|---|---|---|
| RI-1 | Classify 8 boxing actions in real-time at >=30 FPS | Match human defensive reaction speed; voxel temporal features depend on consistent 30fps frame intervals |
| RI-2 | End-to-end latency (camera to motor command) <=150ms | Matches average intermediate boxer defensive reaction time |
| RI-3 | Classification accuracy >=90% on unseen person validation set | Reduce false triggers and incorrect counter-punches during live sparring |
| RI-4 | CV + IMU sensor fusion for confirmed punch events | Single-source detection unreliable; fusion eliminates false positives from camera noise and accidental pad contacts |
| RI-5 | All inference on-device (NVIDIA Jetson Orin NX 16 GB), no cloud | Gym WiFi unreliable; latency budget prohibits cloud round-trip; zero internet requirement |
| RI-6 | Multiple adaptive AI sparring styles with weakness tracking | Prevent user habituation; support varied training objectives |
| RI-7 | Real-time coaching feedback via local LLM | Personalized technique guidance without cloud dependency |
| RI-8 | Real-time person detection and user tracking for yaw and height motor control | Robot must face the user during training; user adjusts height via phone or GUI before each session using CV-provided height data |
Table: Robot Intelligence Requirements and Design Considerations
System Design Narrative
Following the V-Model from Section 3.2, RI-1 to RI-8 were fixed before any model architecture, ROS node, or LLM prompt was written. The diagram below shows how each requirement decomposed into a design decision (left arm), was built and integrated (base), and then verified (right arm). The table underneath maps each V-Model stage to the specific BoxBunny RI work.
| V-Model Stage | Robot Intelligence Focus | Applied to BoxBunny RI | Verification Evidence |
|---|---|---|---|
| Concept & Requirements | Fix measurable targets before any design work begins | RI-1 to RI-8 derived from product needs and on-device deployment constraints. Each requirement drove a specific design choice downstream (see next rows) | Requirements table above with rationale per requirement |
| High-Level Design | Partition the AI stack into independent ROS 2 nodes coordinated by a central state machine |
10-node architecture, 21 custom messages, 6 services:
cv_node
imu_node
punch_processor
session_manager
drill_manager
sparring_engine
llm_node
analytics_node
free_training_engine
robot_node
The 5-state session machine is the single source of truth: all training modes flow through the same lifecycle
|
Node graph (below) + session state machine in 5.3.2 |
| Detailed Design & Build | Design, train, and deploy each AI component, each driven by a specific RI requirement |
RI-1/2: ~24 ms per-frame budget shaped the dual-branch Transformer + TensorRT FP16 deployment RI-3: 9-iteration CV model with leave-one-person-out validation RI-4: punch_processor CV+IMU fusion with pad-constraint filteringRI-5: all inference on-device within the 16 GB shared-memory budget; adaptive frame rate (6 Hz idle / 30 Hz active) shares GPU between CV and LLM RI-6: sparring_engine with Markov chain styles + persistent weakness profileRI-7: local Gemma 4 E2B LLM coach (multimodal, edge-optimised) with 17-document boxing knowledge base + fallback tips RI-8: cv_node publishes UserTracking each frame; yaw motor faces the user, height adjusted by the user via phone or GUIVue 3 + FastAPI phone dashboard connected to per-user SQLite database |
Per-component design narratives in 5.3.1, 5.3.2, 5.3.3 |
| Verification & Validation | Three-tier testing pyramid + real-world deployment to confirm RI-1 to RI-8 |
Unit: 146 pytest tests (sensor fusion, gamification, database) Integration: 28 cross-module tests (ROS messages, fusion end-to-end, motor protocol) + CDE AI Fair early prototype deployment System: Full robot test with real camera, IMU, and motors + model interpretability analysis |
5.3.4.7 verification summary: all 8 RI requirements pass |
Table: Robot Intelligence application of the V-Model from decomposition to verification
Hardware & Tech Stack
Everything below runs on a single NVIDIA Jetson Orin NX, one board carrying perception, fusion, decision logic, the LLM coach, the dashboard backend, and the touchscreen GUI all at once. No external server, no cloud, no internet required.
Open-Source Licensing
If BoxBunny were deployed as a commercial product, the open-source licences in the stack would need to be addressed. Most of the stack is permissive (Apache 2.0, MIT, BSD), but two components carry stricter terms:
| Component | Licence | Commercial implication |
|---|---|---|
| YOLO26 (Ultralytics) | AGPL-3.0 | Requires open-sourcing the full application if distributed, or purchasing an Ultralytics Enterprise licence (~US$5,000/year per developer, quote-based (Ultralytics, 2024)) |
| PySide6 (Qt) | LGPL | OK if dynamically linked (the default), but must allow users to replace the Qt library |
| Everything else (Gemma 4, llama-cpp, ROS 2, FastAPI, Vue 3) |
Apache 2.0 / MIT / BSD | Permissive, no issues |
- The constraint
- The system's 33 ms per-frame budget leaves no room for a slower pose estimator. YOLO26 Nano Pose (
yolo26n-pose, Jocher et al., 2025) is the fastest open-source option available: its key breakthrough is NMS-free end-to-end inference (no post-processing step, final detections in a single forward pass), which combined with progressive loss balancing and edge-optimised architecture runs at ~16 ms (TensorRT FP16) on the Jetson. No other open-source pose estimator fits the budget. - The commercial path
- For deployment, two options: purchase an Ultralytics Enterprise licence, or swap YOLO for an Apache/MIT-licensed pose estimator (e.g. MediaPipe Pose, MMPose) with pipeline re-tuning.
- What's unaffected
- The rest of the action prediction model (voxel branch, transformer, fusion layer) is entirely custom code with no licence constraints. Only the pose estimator component carries the AGPL dependency.
ROS 2 System Architecture
The entire system runs on an NVIDIA Jetson Orin NX using ROS 2 Humble (Robot Operating System 2), an industry-standard framework for building robotics software as a network of independent, communicating processes called nodes. ROS 2 was chosen because it provides reliable real-time messaging between components, allows each subsystem (CV, fusion, sparring, LLM) to be developed and tested independently, and is the standard for robotics projects requiring sensor integration and motor control.
The system consists of 10 ROS 2 nodes communicating via 21 custom message types and 6 services. Every confirmed punch flows through four stages from camera to robot behaviour:
Zooming out, the full ROS 2 node graph below shows every node, every custom topic, and every service in the system, including the side branches (analytics, drill manager, gesture node, robot control) that the 4-stage flow above abstracts away.
Coordinating all 10 nodes during a live session is the job of the
session_manager node and its 5-state machine, covered in detail in
5.3.2.5 Session Management.