IS-431: 5.3 Robot Intelligence

5.3 Robot Intelligence (Yogeeswaran)

Robot Intelligence is the AI brain that turns BoxBunny from a static striking target into an adaptive sparring partner. It sees what the boxer is doing in real time, decides how the robot should respond, drives the LLM coach that gives feedback, and surfaces every session as analytics on a phone dashboard.

At its core is a computer vision pipeline that classifies 8 boxing actions (6 punch types, block, and idle) using depth-based 3D voxel motion fused with 2D pose estimation. A ROS 2 workspace on an NVIDIA Jetson Orin NX orchestrates that perception output alongside IMU sensing, an adaptive sparring engine, and a local Gemma 4 E2B LLM coach, with everything running on-device, no cloud, no internet required.

Section structure

This page: requirements, hardware, ROS 2 system architecture
5.3.1: Computer Vision & action prediction model
5.3.2: Sensor fusion, sparring AI, LLM coach
5.3.3: Dashboard backend & analytics
5.3.4: Testing & verification of all 7 requirements

Where this subsystem sits in BoxBunny

BoxBunny has three sibling subsystems wired through ROS 2. Robot Intelligence sits in the middle, deciding what the robot does next.

5.1 GUI

User input & live feedback

Touchscreen on the robot. User inputs in, live session state out.

↔

5.3 Robot Intelligence · this section

The decision-making brain

CV, sensor fusion, sparring AI, LLM coach, sensor data to commands.

↔

5.2 Robot Mechanism

Physical actuation & sensing

Punch commands in, pad impact events out.

Together, the three fulfil the product needs from the user research: Intelligent Sparring System (5.3.1 + 5.3.2), Adaptive Fight Intelligence (5.3.2), Skill Progression Studio (5.3.2), and Performance Analytics (5.3.3).

Key Achievements

Action Prediction Model

96.8% validation accuracy on an unseen person

Classifies 8 boxing actions at 30 fps on an NVIDIA Jetson. Built over 9 iterations as a dual-branch (voxel + pose) Transformer, full story in 5.3.1.

ROS 2 System

10 ROS 2 nodes · 21 messages · 6 services

Orchestrates CV inference, IMU sensor fusion, 5 adaptive AI sparring styles with weakness tracking, and a local Gemma 4 E2B LLM coach running entirely on-device.

Dashboard & Analytics

Vue 3 + FastAPI

Phone dashboard for remote training control, performance trends, AI coach chat, and population benchmarking. Public-URL tunnel lets any phone connect from any network, no shared WiFi required.

Testing & Validation

146 unit + 28 integration tests

Real-world deployment with public users at the CDE AI Fair, full model interpretability analysis, and all 7 requirements verified in 5.3.4.

GitHub Repositories:

Action Recognition

Robot Workspace

Requirements and Design Considerations

This subsystem addresses DO-1 (Performance Analytics), DO-2 (Intelligent Sparring System), DO-3 (Skill Progression Studio), and DO-4 (Adaptive Fight Intelligence). See Section 5 for the full Design Objectives reference.

These requirements were derived from the Function Analysis (Table 4, Section 3.1), the user journey defined in Section 4.3, the design objectives (DO-1 to DO-4), and the physical constraints of deploying real-time computer vision on a single depth camera within an embedded platform's memory and compute budget.

ID	Requirement	Rationale	Source
RI-1	Classify 8 boxing actions in real-time at ≥30 FPS	6 standard punches plus block and idle cover the full action vocabulary; 30 FPS matches human defensive reaction speed	DO-2, DO-4; Product need: CV-based reactive movements (Section 3.1)
RI-2	End-to-end latency (camera to motor command) ≤150 ms	Matches average intermediate boxer defensive reaction time	DO-4; User journey active training phase (Section 4.3)
RI-3	High classification accuracy that generalises across different users	Reduce false triggers and incorrect counter-punches during live sparring	DO-1; Appendix 6: accurate punch classification for analytics
RI-4	The system must reliably distinguish landed punches from missed or shadow punches	Session analytics, coaching feedback, and sparring engine responses all depend on accurate hit data; miscounts corrupt the training experience	DO-1, DO-4; Product need: synchronise sensor data with CV for real-time insights (Section 3.1)
RI-5	All inference on-device with no cloud dependency	Gym WiFi unreliable; latency prohibits cloud round-trips; the system must work without internet	DO-5; Product need: gym/home deployable, cost-efficient (Section 3.1)
RI-6	Multiple adaptive AI sparring styles with weakness tracking	A single sparring pattern becomes predictable; multiple styles keep the user challenged while weakness tracking targets the areas they struggle with most	DO-2, DO-3; Product need: progressive training modules scaling with performance (Section 3.1)
RI-7	Interpret session performance data into personalised training guidance	A beginner seeing raw stats (e.g. 40% hook block rate) does not know what to fix; the system must explain what is going wrong and what to practise in plain language	DO-1, DO-3, DO-4; Product need: real-time cues reinforcing correct form, rhythm, timing (Section 3.1)

Table: Robot Intelligence Requirements, Rationale, and Sources

System Design Narrative

Following the V-Model from Section 3.2, RI-1 to RI-7 were fixed before any model architecture, ROS node, or LLM prompt was written. The diagram below shows how each requirement decomposed into a design decision (left arm), was built and integrated (base), and then verified (right arm). The table underneath maps each V-Model stage to the specific BoxBunny RI work.

_{Figure: Robot Intelligence Systems Engineering V-Model showing requirement decomposition (left), integration build (centre), and verification closure (right) for RI-1 to RI-7.}

V-Model Stage	Robot Intelligence Focus	Applied to BoxBunny RI	Verification Evidence
Concept & Requirements	Fix measurable targets before any design work begins	RI-1 to RI-7 derived from product needs and on-device deployment constraints. Each requirement drove a specific design choice downstream (see next rows)	Requirements table above with rationale per requirement
High-Level Design	Partition the AI stack into independent ROS 2 nodes coordinated by a central state machine	10-node architecture, 21 custom messages, 6 services: `cv_node` `imu_node` `punch_processor` `session_manager` `drill_manager` `sparring_engine` `llm_node` `analytics_node` `free_training_engine` `robot_node` The 5-state session machine is the single source of truth: all training modes flow through the same lifecycle	Node graph (below) + session state machine in 5.3.2
Detailed Design & Build	Design, train, and deploy each AI component, each driven by a specific RI requirement	RI-1/2: ~24 ms per-frame budget shaped the dual-branch Transformer + TensorRT FP16 deployment RI-3: 9-iteration CV model with leave-one-person-out validation RI-4: `punch_processor` CV+IMU fusion with pad-constraint filtering RI-5: all inference on-device within the 16 GB shared-memory budget; adaptive frame rate (6 Hz idle / 30 Hz active) shares GPU between CV and LLM RI-6: `sparring_engine` with Markov chain styles + persistent weakness profile RI-7: local Gemma 4 E2B LLM coach (multimodal, edge-optimised) with 17-document boxing knowledge base + fallback tips Vue 3 + FastAPI phone dashboard connected to per-user SQLite database	Per-component design narratives in 5.3.1, 5.3.2, 5.3.3
Verification & Validation	Three-tier testing pyramid + real-world deployment to confirm RI-1 to RI-7	Unit: 146 pytest tests (sensor fusion, gamification, database) Integration: 28 cross-module tests (ROS messages, fusion end-to-end, motor protocol) + CDE AI Fair early prototype deployment System: Full robot test with real camera, IMU, and motors + model interpretability analysis	5.3.4.7 verification summary: all 7 RI requirements pass

Table: Robot Intelligence application of the V-Model from decomposition to verification

Hardware & Tech Stack

Everything below runs on a single NVIDIA Jetson Orin NX, one board carrying perception, fusion, decision logic, the LLM coach, the dashboard backend, and the touchscreen GUI all at once. No external server, no cloud, no internet required.

Hardware

Jetson Orin NX 16GB Ubuntu 22.04 CUDA 12.6 RealSense D435i Teensy 4.0 4× MPU6050

Models

Voxel-Pose Transformer 1.75M YOLO Pose nano Gemma 4 E2B GGUF Q4_K_M PyTorch → TensorRT FP16 llama-cpp-python

Robotics

ROS 2 Humble 10 nodes 21 messages 6 services

Backend

FastAPI + Uvicorn SQLite (two-tier) WebSocket localhost.run tunnel

Frontends

PySide6 / Qt, 10.1″ touchscreen Vue 3 + Tailwind, phone dashboard

Open-Source Licensing

If BoxBunny were deployed as a commercial product, the open-source licences in the stack would need to be addressed. Most of the stack is permissive (Apache 2.0, MIT, BSD), but two components carry stricter terms:

Component	Licence	Commercial implication
YOLO26 (Ultralytics)	AGPL-3.0	Requires open-sourcing the full application if distributed, or purchasing an Ultralytics Enterprise licence (~US$5,000/year per developer, quote-based (Ultralytics, 2024))
PySide6 (Qt)	LGPL	OK if dynamically linked (the default), but must allow users to replace the Qt library
Everything else (Gemma 4, llama-cpp, ROS 2, FastAPI, Vue 3)	Apache 2.0 / MIT / BSD	Permissive, no issues

Why YOLO despite AGPL?

The constraint: The system's 33 ms per-frame budget leaves no room for a slower pose estimator. YOLO26 Nano Pose (yolo26n-pose, Jocher et al., 2025) is the fastest open-source option available: its key breakthrough is NMS-free end-to-end inference (no post-processing step, final detections in a single forward pass), which combined with progressive loss balancing and edge-optimised architecture runs at ~16 ms (TensorRT FP16) on the Jetson. No other open-source pose estimator fits the budget.
The commercial path: For deployment, two options: purchase an Ultralytics Enterprise licence, or swap YOLO for an Apache/MIT-licensed pose estimator (e.g. MediaPipe Pose, MMPose) with pipeline re-tuning.
What's unaffected: The rest of the action prediction model (voxel branch, transformer, fusion layer) is entirely custom code with no licence constraints. Only the pose estimator component carries the AGPL dependency.

ROS 2 System Architecture

The entire system runs on an NVIDIA Jetson Orin NX using ROS 2 Humble (Robot Operating System 2), an industry-standard framework for building robotics software as a network of independent, communicating processes called nodes. ROS 2 was chosen because it provides reliable real-time messaging between components, allows each subsystem (CV, fusion, sparring, LLM) to be developed and tested independently, and is the standard for robotics projects requiring sensor integration and motor control.

The system consists of 10 ROS 2 nodes communicating via 21 custom message types and 6 services. Every confirmed punch flows through four stages from camera to robot behaviour:

Zooming out, the full ROS 2 node graph below shows every node, every custom topic, and every service in the system, including the side branches (analytics, drill manager, gesture node, robot control) that the 4-stage flow above abstracts away.

Coordinating all 10 nodes during a live session is the job of the session_manager node and its 5-state machine, covered in detail in 5.3.2.5 Session Management.

Testing & Validation

CDE Fair deployment, 146 unit + 28 integration tests, model interpretability analysis

Appendix 7: Robot Intelligence

5.3 Robot Intelligence (Yogeeswaran)

Where this subsystem sits in BoxBunny

Key Achievements

Requirements and Design Considerations

System Design Narrative

Hardware & Tech Stack

Open-Source Licensing

ROS 2 System Architecture

Subsections

CV & Action Prediction

Intelligent Behaviour & Control

Dashboard & Analytics

Testing & Validation