Testing and Evaluation
This section presents the functional testing conducted during development iterations and user testing observations from the public showcase event.
V-Model alignment: This page corresponds to the right side of the V, where GUI requirements are verified against measurable outcomes.
The test results are also product feedback. They show whether the GUI felt usable as a training tool, not just whether the implementation was technically correct.
Developer Functional Testing
Testing was conducted incrementally throughout development. At each iteration,
new features were verified by navigating the full affected flow and checking both
the expected case and known failure modes. After each development session,
a standard audit was run: navigate every affected page, confirm all
PageIndex constants matched their addWidget positions,
and verify that surrounding features were unaffected. The cases below document
defects discovered and resolved through this process.
| TC | Category | Observation | Root Cause | Resolution |
|---|---|---|---|---|
| TC-01 | Navigation integrity | Performance page and Others page showed blank screens after sparring pages were added | Sparring pages inserted at index 22 in the addWidget block, shifting all subsequent page indices |
Full index audit; block restored to correct order |
| TC-02 | Navigation integrity | Five pages (Performance, Others, PowerInstructions, StaminaInstructions, ReactionInstructions) showed blank screens after an edit session | Page instantiation lines replaced with QWidget() stubs during automated editing; detection required searching MainWindow.__init__ for all self.x = QWidget() patterns |
Pages restored to correct class instantiations; audit pattern added to post-edit checklist |
| TC-03 | Navigation integrity | App crashed on launch with RecursionError (992 frames) after circular import fix | PunchCombinationPage.__init__ called on_difficulty_clicked("Beginner") immediately at construction rather than connecting it as a signal callback |
All button connections converted to lambda callbacks |
| TC-04 | Navigation integrity | Self-Select button navigated to wrong page; console output showed False button clicked |
Qt clicked signal passes a boolean checked argument that overrides lambda default parameters |
Pattern changed to lambda checked=False: self.method("arg") throughout all navigation lambdas |
| TC-05 | Feature regression | Self-Select page showed Basic Parameters content after an edit session | Entire SelfSelectSequencePage class silently replaced with a simplified version missing the numpad, defense buttons, and reorder list |
Class restored from last confirmed working version; verified by checking for sequence_input, confirm_sequence, and move_sequence_up identifiers |
| TC-06 | Data layer | Combo training sessions failed to save after CV placeholder was wired to return 0-5 scores | SQLite CHECK(score BETWEEN 0.0 AND 1.0) constraint rejected scores above 1.0 |
Schema rebuilt with CHECK(score BETWEEN 0.0 AND 5.0); mastery threshold updated to 3.0/5.0 |
| TC-07 | Navigation integrity | Back button on Performance History always returned to Settings regardless of entry path | Hardcoded back destinations on every page with no runtime state tracking | Navigation stack implemented in Iteration 5; entry point recorded before every transition |
| TC-08 | Feature regression | Test account persisted in users.json after pressing Back on the proficiency checklist |
self._username was empty string at Back press because load_for_user() call sequence was incorrect in the signup flow |
Call sequence corrected; verified by creating and discarding a test account |
| TC-09 | Feature regression | Fixed-size buttons on ProfiencyChecklistPage appeared at full navigation size (360x65 px) despite setFixedSize() calls |
setup_navigation() in ButtonNavigationMixin applies min-width/min-height stylesheet overrides to all QPushButton children after construction |
SKIP_NAV_SETUP = True class attribute added; inline min-width: 1px; min-height: 1px overrides applied |
| TC-10 | Integration verification | Tooltip text across Homepage and TrainingPage did not match agreed specification after implementation | Content mismatch between implemented strings and specified tooltip text | Targeted correction pass across all _attach_tooltips methods; each tooltip verified against specification |
| TC-11 | Data layer | Combo progress showed inflated percentage after few training sessions | Progress formula averaged only attempted combos, excluding unplayed combos from the denominator | Denominator fixed to always use total combo count at the user's level, so unplayed combos score as 0 |
| TC-12 | Data layer | Session counts in CSV did not match combo attempt counts in database after migration | Legacy CSV contained entries from all users mixed together from before per-user isolation was introduced | CSV history reset on migration; per-user SQLite database became sole source of truth |
| TC-13 | Feature regression | Beginner-level users could access Sparring mode after proficiency assessment was added | Proficiency gate logic not yet wired into SparPage navigation handler |
Gate condition added; Beginner users see a lockout message; Intermediate and Advanced proceed normally |
| TC-14 | Integration verification | LLM chat showed placeholder-looking responses with no indication of whether the model was loaded | Fallback to hardcoded responses was silent; no status message distinguishing real LLM output from fallback | Explicit status messages added; chat always shows whether the local model, Anthropic API fallback, or hardcoded fallback is active |
Table 5.1.3-1: Developer Functional Test Cases (incremental, across all six iterations)
Functional Testing
Functional testing was conducted on the features available at the time of the "Robotics Meets AI Showcase" event in late January 2026. The features tested were the combo curriculum system, the reaction time test, and the navigation stack. Test scenarios verified correct behaviour under normal use conditions.
The focus was on whether the product supported training flow cleanly. Each test asked a practical question: can the boxer start quickly, stay in the session, and get back to progress review without friction?
All functional tests passed. The navigation stack test is worth highlighting: the scenario traced a five-page deep navigation path and verified that pressing back four times returned through each page in the correct reverse order, confirming that the centralised stack handled arbitrary navigation depth correctly.
Automated Test Suite
The project runs 146 pytest tests across all subsystems. The GUI-relevant tests cover the following areas:
- Navigation stack: arbitrary-depth push and pop sequences verify correct reverse-order page return
- Multi-user data isolation: five parallel test accounts confirm zero cross-contamination across user data
- Combo curriculum mastery algorithm: score accumulation, unlock threshold, and tier advancement logic
- Gamification: XP calculation and rank thresholds across all six rank tiers
- Pattern lock authentication: SHA-256 hash encode and verify round-trip
- GuiBridge offline degradation: all service calls return failure cleanly when the ROS stack is unavailable, confirming the mock interface functions correctly
Tests are executed from the project's master Jupyter notebook and run independently of hardware. All GUI tests pass without a connected Jetson or ROS environment.
User Testing: Robotics Meets AI Showcase
Setup
Members of the public physically interacted with BoxBunny at the "Robotics Meets AI Showcase" event in late January 2026, specifically within the Robotics and AI in Work and Play segment. Two displays were available during the demonstration: the original 7-inch touchscreen mounted directly on the robot, and a larger external monitor connected to the same Jetson Orin NX showing the same GUI output.
This setup was useful because it exposed the product under real user conditions. The event showed how the GUI performed when people approached it as a training tool rather than as a demo screen.
Showcase Observations Summary
The following table summarises observations recorded at the showcase. This was a single convenience-sample event, not a structured usability study. No task completion times or formal rating scales were collected. Findings are treated as directional evidence from direct observation, not statistically significant results. Participant count was approximately >50 people who interacted with the robot during the event.
| Observation | Users Affected | Severity | Action Taken |
|---|---|---|---|
| Difficulty reading combo prompts on the 7-inch display during active training | Majority of participants who attempted the 7-inch display | High | Display upgraded to 10.1-inch IPS capacitive touchscreen (1280x800) |
| Text too small to follow comfortably while focusing on punching technique | Several participants (reported verbally) | High | Confirmed display upgrade decision; proportional layout meant no redesign was required |
| Participants consistently gravitated toward the larger external monitor for reading results and navigating | All participants who used both displays | Medium | Validated the hardware revision direction before implementation |
| No readability or navigation complaints reported when using the larger monitor | All who switched to larger monitor | None | Confirmed adequate font sizing and layout hierarchy on larger display |
| Navigation between main menu and training pages required no verbal instruction | All observed participants | None | Navigation hierarchy confirmed intuitive for first-time users; no changes required |
Table 5.1.3-3: Showcase Event Observations - "Robotics and AI in Work and Play" (single event, directional evidence only)
Observations
During the session, participants and observers consistently gravitated toward the larger monitor when reading training results and navigating between pages. Users who attempted to operate the 7-inch screen during active training reported difficulty reading combo prompts and statistics. Several participants noted that the text was too small to follow comfortably while also focusing on technique. Users who switched to the larger monitor did not report the same issues.
This confirmed that the product needed stronger readability and a clearer information hierarchy during active training. The issue was not only display size, but how well the GUI supported attention split between movement and screen reading.
Display Upgrade Decision
These observations validated a hardware revision. The 7-inch display was replaced with a 10.1-inch IPS capacitive touchscreen (1280x800 resolution, DSI interface). The upgrade improved text legibility and made the analytics pages easier to read during and after training. Because the GUI used proportional layout sizing rather than fixed pixel coordinates, the existing interface scaled to the new resolution without requiring a redesign cycle.
The decision followed the user evidence rather than the hardware preference. It was a product change driven by how boxers actually used the interface in training.
Performance Criteria Assessment
The table below maps each GUI requirement to its target, verification method, and outcome.
| ID | Requirement | Target | Verification | Met? |
|---|---|---|---|---|
| GUI-1 | Touchscreen and physical button operable: large touch targets, minimal text input, IMU-based button navigation | All interactive elements ≥60px; full navigation via 4 pads | Functional test; showcase observation; IMU nav confirmed in live session | Yes |
| GUI-2 | Multi-user accounts with complete data isolation between users | Zero cross-contamination across accounts | 5 test accounts; pytest isolation tests; no cross-user data observed | Yes |
| GUI-3 | Structured training progression through a 50-combo curriculum with mastery-based advancement | Unlock after 5 sessions with average score ≥3.0/5.0 | Curriculum algorithm pytest; multiple complete training cycles observed | Yes |
| GUI-4 | Real-time session data display (combo prompts, round timers, performance metrics) | Combo prompt, timer, and punch counter update within 33 ms (one frame at 30 fps, the Jetson Nano DSI display refresh rate) | Observed during live sessions; no visible lag between punch event and counter update; Qt event loop confirmed single-threaded update on signal receipt | Yes |
| GUI-5 | Hardware-independent development via an abstracted integration layer with mock interfaces | All pages navigable and all features operable with no serial hardware connected; mock-interface mode activates automatically on hardware-absent startup; no import errors or page crashes on laptop | Entire development cycle conducted on laptop; mock-interface tests confirmed expected behavior | Yes |
Table 5.1.3-4: GUI Performance Criteria Assessment
Requirements Verification
The table below maps each GUI requirement back to its verification evidence.
Limitations
The following limitations apply to the testing evidence presented in this section. They are stated to give an accurate picture of what the results can and cannot claim.
User Testing Scope
The showcase event was a single convenience-sample demonstration, not a structured usability study with recruited participants. No task completion times, error counts, or structured rating scales (such as the System Usability Scale) were collected. Observations reflect what was directly visible during approximately two hours of public interaction at one event. Conclusions drawn from this data are directional only and cannot be generalised to the full target user population.
Hardware Access During Development
The GUI was developed predominantly on a Windows laptop using mock interfaces. Integration testing against live robot hardware was limited by hardware availability during the development phase. Some edge cases in serial communication, CV handshake timing, and multi-subsystem state transitions may not have been surfaced under the simulation conditions used.
Reflection
What Worked
Iteration Control and Defect Containment
The iterative development approach was the correct call for this project. Every significant defect discovered during testing was caught and resolved within the iteration that introduced it. The score scale mismatch (TC-06), navigation stack bug (TC-07), and blank page pattern (TC-02) would all have been costly to fix in a later phase. Catching them early kept the codebase stable.
Mock Interfaces and Integration Readiness
The mock interface architecture was the single most valuable structural decision.
By abstracting all hardware communication behind a GuiBridge layer with
a swap-in mock, the full GUI was runnable and testable on a Windows laptop throughout
development. This meant that six complete iterations were validated before any
physical hardware connection was required. The integration work at the end of the
project was straightforward precisely because the interfaces had been designed and
tested in isolation first.
Data Isolation Confidence
The per-user SQLite database design also performed well under test. Five test accounts were created and exercised across training, performance, and sparring flows with no observed cross-user data contamination.
Future Work
Structured Usability Study
The most significant gap in the current evidence base is the absence of structured user testing. A follow-on study with 8 to 12 recruited boxing gym members, using timed task scenarios (login to session start, combo training completion, results review) and a System Usability Scale questionnaire, would provide quantitative usability data and allow comparison against benchmark scores. This should be conducted on the physical robot in a gym environment, not in a lab.
Expanded Performance History Analytics
The current history pages display tabular results. Trend visualisation across sessions (power trend, stamina fatigue curve, reaction time improvement) would provide more actionable insight for users tracking progress over weeks. The data is already being stored per-user; the work is in the presentation layer only.
Cloud-Synced User Data
The per-user SQLite file approach is well suited to a single-unit deployment but is not designed for scale. A cloud-backed storage layer would allow a user's training history to follow them across multiple BoxBunny units at different gyms, and would enable coaches to review athlete progress remotely without needing physical access to the robot. The per-user folder structure already maps cleanly to a per-document model in a cloud database. This is over-engineered for the current single-unit context but becomes the natural architecture at gym-network scale.
Distributable Desktop Application
The current GUI is a Python library that requires the full development environment
to run on the Jetson. For a larger deployment, packaging the analytics and
configuration interface as a standalone executable would decouple it from the robot
hardware entirely. Coaches and gym managers could run the interface on a Windows or
macOS machine to review athlete data, configure curricula, and export reports without
the robot being present. The GUI codebase is structured around the GuiBridge
abstraction layer, so a desktop build with a remote bridge back to the robot is a
natural extension of the existing architecture. This is over-engineered for a
single-unit setup but practical for a commercial product.
ML-Derived Sparring Sequences from Boxing Video
The current Markov chain state transition matrices were authored manually based on boxing coaching principles. A model trained on labelled boxing footage could extract real punch sequences and transition probabilities from recorded matches, producing a sequence generator grounded in observed sparring behaviour rather than editorial judgement. Public boxing archives provide a large candidate corpus; the engineering work is in frame-level punch labelling and sequence extraction. The resulting transition matrices could replace the current hand-authored definitions, making the sparring engine considerably more stylistically varied at higher difficulty levels. This is over-engineered for the current system but becomes viable when scaling to support a wider range of boxing styles and skill levels.
ML-Seeded Combo Curriculum
The current 50-combination curriculum was authored manually, drawing on boxing coaching principles and common pad-work patterns. If a video-based sequence extraction pipeline of the kind described above were developed, the extracted sequences could seed an expanded curriculum: combinations that real boxers throw most frequently would be included as core drills, while less common sequences would appear in higher tiers. This would ground curriculum design in empirical frequency data rather than editorial selection, and would allow the curriculum to grow automatically as more footage is processed. Again, this is over-engineered for the current 50-combo scope but appropriate for a larger training product.