Back to 5.1 User Interface (GUI)

Testing and Evaluation

This section presents the functional testing conducted during development iterations and user testing observations from the public showcase event.

V-Model alignment: This page corresponds to the right side of the V, where GUI requirements are verified against measurable outcomes.

The test results are also product feedback. They show whether the GUI felt usable as a training tool, not just whether the implementation was technically correct.

Validation Loop Testing closed the loop between requirements, user feedback, and the final training experience. GUI requirements defined in Section 5.1 Functional tests pass/fail behavior checks User testing showcase observations Design revision larger display, clearer hierarchy, verified flow

Developer Functional Testing

Testing was conducted incrementally throughout development. At each iteration, new features were verified by navigating the full affected flow and checking both the expected case and known failure modes. After each development session, a standard audit was run: navigate every affected page, confirm all PageIndex constants matched their addWidget positions, and verify that surrounding features were unaffected. The cases below document defects discovered and resolved through this process.

TC Category Observation Root Cause Resolution
TC-01 Navigation integrity Performance page and Others page showed blank screens after sparring pages were added Sparring pages inserted at index 22 in the addWidget block, shifting all subsequent page indices Full index audit; block restored to correct order
TC-02 Navigation integrity Five pages (Performance, Others, PowerInstructions, StaminaInstructions, ReactionInstructions) showed blank screens after an edit session Page instantiation lines replaced with QWidget() stubs during automated editing; detection required searching MainWindow.__init__ for all self.x = QWidget() patterns Pages restored to correct class instantiations; audit pattern added to post-edit checklist
TC-03 Navigation integrity App crashed on launch with RecursionError (992 frames) after circular import fix PunchCombinationPage.__init__ called on_difficulty_clicked("Beginner") immediately at construction rather than connecting it as a signal callback All button connections converted to lambda callbacks
TC-04 Navigation integrity Self-Select button navigated to wrong page; console output showed False button clicked Qt clicked signal passes a boolean checked argument that overrides lambda default parameters Pattern changed to lambda checked=False: self.method("arg") throughout all navigation lambdas
TC-05 Feature regression Self-Select page showed Basic Parameters content after an edit session Entire SelfSelectSequencePage class silently replaced with a simplified version missing the numpad, defense buttons, and reorder list Class restored from last confirmed working version; verified by checking for sequence_input, confirm_sequence, and move_sequence_up identifiers
TC-06 Data layer Combo training sessions failed to save after CV placeholder was wired to return 0-5 scores SQLite CHECK(score BETWEEN 0.0 AND 1.0) constraint rejected scores above 1.0 Schema rebuilt with CHECK(score BETWEEN 0.0 AND 5.0); mastery threshold updated to 3.0/5.0
TC-07 Navigation integrity Back button on Performance History always returned to Settings regardless of entry path Hardcoded back destinations on every page with no runtime state tracking Navigation stack implemented in Iteration 5; entry point recorded before every transition
TC-08 Feature regression Test account persisted in users.json after pressing Back on the proficiency checklist self._username was empty string at Back press because load_for_user() call sequence was incorrect in the signup flow Call sequence corrected; verified by creating and discarding a test account
TC-09 Feature regression Fixed-size buttons on ProfiencyChecklistPage appeared at full navigation size (360x65 px) despite setFixedSize() calls setup_navigation() in ButtonNavigationMixin applies min-width/min-height stylesheet overrides to all QPushButton children after construction SKIP_NAV_SETUP = True class attribute added; inline min-width: 1px; min-height: 1px overrides applied
TC-10 Integration verification Tooltip text across Homepage and TrainingPage did not match agreed specification after implementation Content mismatch between implemented strings and specified tooltip text Targeted correction pass across all _attach_tooltips methods; each tooltip verified against specification
TC-11 Data layer Combo progress showed inflated percentage after few training sessions Progress formula averaged only attempted combos, excluding unplayed combos from the denominator Denominator fixed to always use total combo count at the user's level, so unplayed combos score as 0
TC-12 Data layer Session counts in CSV did not match combo attempt counts in database after migration Legacy CSV contained entries from all users mixed together from before per-user isolation was introduced CSV history reset on migration; per-user SQLite database became sole source of truth
TC-13 Feature regression Beginner-level users could access Sparring mode after proficiency assessment was added Proficiency gate logic not yet wired into SparPage navigation handler Gate condition added; Beginner users see a lockout message; Intermediate and Advanced proceed normally
TC-14 Integration verification LLM chat showed placeholder-looking responses with no indication of whether the model was loaded Fallback to hardcoded responses was silent; no status message distinguishing real LLM output from fallback Explicit status messages added; chat always shows whether the local model, Anthropic API fallback, or hardcoded fallback is active

Table 5.1.3-1: Developer Functional Test Cases (incremental, across all six iterations)

Defect Category Distribution 14 test cases across all six iterations, each resolved within the iteration that introduced it 1 2 3 4 5 Navigation integrity 5 TC-01 TC-02 TC-03 TC-04 TC-07 Feature regression 4 TC-05 TC-08 TC-09 TC-13 Data layer 3 TC-06 TC-11 TC-12 Integration verification 2 TC-10 TC-14

Functional Testing

Functional testing was conducted on the features available at the time of the "Robotics Meets AI Showcase" event in late January 2026. The features tested were the combo curriculum system, the reaction time test, and the navigation stack. Test scenarios verified correct behaviour under normal use conditions.

The focus was on whether the product supported training flow cleanly. Each test asked a practical question: can the boxer start quickly, stay in the session, and get back to progress review without friction?

All functional tests passed. The navigation stack test is worth highlighting: the scenario traced a five-page deep navigation path and verified that pressing back four times returned through each page in the correct reverse order, confirming that the centralised stack handled arbitrary navigation depth correctly.

Automated Test Suite

The project runs 146 pytest tests across all subsystems. The GUI-relevant tests cover the following areas:

Automated Test Suite Coverage 146 pytest tests across all subsystems; GUI-relevant modules shown below, all passing Navigation stack Arbitrary-depth push/pop correct reverse-order return PASS Multi-user isolation 5 parallel test accounts zero cross-contamination PASS Combo curriculum Score accumulation, unlock threshold, tier advancement PASS Gamification XP calculation and rank thresholds across 6 tiers PASS Pattern lock auth SHA-256 hash encode and verify round-trip PASS GuiBridge degradation All service calls return failure cleanly without ROS PASS

Tests are executed from the project's master Jupyter notebook and run independently of hardware. All GUI tests pass without a connected Jetson or ROS environment.

User Testing: Robotics Meets AI Showcase

Setup

Members of the public physically interacted with BoxBunny at the "Robotics Meets AI Showcase" event in late January 2026, specifically within the Robotics and AI in Work and Play segment. Two displays were available during the demonstration: the original 7-inch touchscreen mounted directly on the robot, and a larger external monitor connected to the same Jetson Orin NX showing the same GUI output.

This setup was useful because it exposed the product under real user conditions. The event showed how the GUI performed when people approached it as a training tool rather than as a demo screen.

Showcase Observations Summary

The following table summarises observations recorded at the showcase. This was a single convenience-sample event, not a structured usability study. No task completion times or formal rating scales were collected. Findings are treated as directional evidence from direct observation, not statistically significant results. Participant count was approximately >50 people who interacted with the robot during the event.

Observation Users Affected Severity Action Taken
Difficulty reading combo prompts on the 7-inch display during active training Majority of participants who attempted the 7-inch display High Display upgraded to 10.1-inch IPS capacitive touchscreen (1280x800)
Text too small to follow comfortably while focusing on punching technique Several participants (reported verbally) High Confirmed display upgrade decision; proportional layout meant no redesign was required
Participants consistently gravitated toward the larger external monitor for reading results and navigating All participants who used both displays Medium Validated the hardware revision direction before implementation
No readability or navigation complaints reported when using the larger monitor All who switched to larger monitor None Confirmed adequate font sizing and layout hierarchy on larger display
Navigation between main menu and training pages required no verbal instruction All observed participants None Navigation hierarchy confirmed intuitive for first-time users; no changes required

Table 5.1.3-3: Showcase Event Observations - "Robotics and AI in Work and Play" (single event, directional evidence only)

Observations

During the session, participants and observers consistently gravitated toward the larger monitor when reading training results and navigating between pages. Users who attempted to operate the 7-inch screen during active training reported difficulty reading combo prompts and statistics. Several participants noted that the text was too small to follow comfortably while also focusing on technique. Users who switched to the larger monitor did not report the same issues.

This confirmed that the product needed stronger readability and a clearer information hierarchy during active training. The issue was not only display size, but how well the GUI supported attention split between movement and screen reading.

Display Upgrade Decision

These observations validated a hardware revision. The 7-inch display was replaced with a 10.1-inch IPS capacitive touchscreen (1280x800 resolution, DSI interface). The upgrade improved text legibility and made the analytics pages easier to read during and after training. Because the GUI used proportional layout sizing rather than fixed pixel coordinates, the existing interface scaled to the new resolution without requiring a redesign cycle.

The decision followed the user evidence rather than the hardware preference. It was a product change driven by how boxers actually used the interface in training.

Display Hardware Revision Showcase observations drove the upgrade from 7-inch to 10.1-inch display BEFORE 7-inch · 1024 × 600 Difficulty reading combo prompts Text too small during active training Showcase observation AFTER 10.1-inch · 1280 × 800 · IPS capacitive No readability complaints reported Navigation clear without verbal instruction

Performance Criteria Assessment

The table below maps each GUI requirement to its target, verification method, and outcome.

ID Requirement Target Verification Met?
GUI-1 Touchscreen and physical button operable: large touch targets, minimal text input, IMU-based button navigation All interactive elements ≥60px; full navigation via 4 pads Functional test; showcase observation; IMU nav confirmed in live session Yes
GUI-2 Multi-user accounts with complete data isolation between users Zero cross-contamination across accounts 5 test accounts; pytest isolation tests; no cross-user data observed Yes
GUI-3 Structured training progression through a 50-combo curriculum with mastery-based advancement Unlock after 5 sessions with average score ≥3.0/5.0 Curriculum algorithm pytest; multiple complete training cycles observed Yes
GUI-4 Real-time session data display (combo prompts, round timers, performance metrics) Combo prompt, timer, and punch counter update within 33 ms (one frame at 30 fps, the Jetson Nano DSI display refresh rate) Observed during live sessions; no visible lag between punch event and counter update; Qt event loop confirmed single-threaded update on signal receipt Yes
GUI-5 Hardware-independent development via an abstracted integration layer with mock interfaces All pages navigable and all features operable with no serial hardware connected; mock-interface mode activates automatically on hardware-absent startup; no import errors or page crashes on laptop Entire development cycle conducted on laptop; mock-interface tests confirmed expected behavior Yes

Table 5.1.3-4: GUI Performance Criteria Assessment

Requirements Verification

The table below maps each GUI requirement back to its verification evidence.

Limitations

The following limitations apply to the testing evidence presented in this section. They are stated to give an accurate picture of what the results can and cannot claim.

User Testing Scope

The showcase event was a single convenience-sample demonstration, not a structured usability study with recruited participants. No task completion times, error counts, or structured rating scales (such as the System Usability Scale) were collected. Observations reflect what was directly visible during approximately two hours of public interaction at one event. Conclusions drawn from this data are directional only and cannot be generalised to the full target user population.

Hardware Access During Development

The GUI was developed predominantly on a Windows laptop using mock interfaces. Integration testing against live robot hardware was limited by hardware availability during the development phase. Some edge cases in serial communication, CV handshake timing, and multi-subsystem state transitions may not have been surfaced under the simulation conditions used.

Reflection

What Worked

Iteration Control and Defect Containment

The iterative development approach was the correct call for this project. Every significant defect discovered during testing was caught and resolved within the iteration that introduced it. The score scale mismatch (TC-06), navigation stack bug (TC-07), and blank page pattern (TC-02) would all have been costly to fix in a later phase. Catching them early kept the codebase stable.

Mock Interfaces and Integration Readiness

The mock interface architecture was the single most valuable structural decision. By abstracting all hardware communication behind a GuiBridge layer with a swap-in mock, the full GUI was runnable and testable on a Windows laptop throughout development. This meant that six complete iterations were validated before any physical hardware connection was required. The integration work at the end of the project was straightforward precisely because the interfaces had been designed and tested in isolation first.

Data Isolation Confidence

The per-user SQLite database design also performed well under test. Five test accounts were created and exercised across training, performance, and sparring flows with no observed cross-user data contamination.

Future Work

Future Work Overview Near-term improvements to the current system, and over-engineered concepts viable at larger deployment scale Current system Structured Usability Study Recruited gym members, SUS questionnaire, timed task scenarios on physical robot Expanded Performance Analytics Trend charts: power, stamina fatigue, reaction time improvement across sessions Gap in evidence or presentation layer only at scale At gym-network scale Cloud-Synced User Data Multi-unit sync, remote coach access Distributable Desktop App Standalone exe, cross-platform ML Sparring Sequences Video-derived Markov chain transitions ML-Seeded Curriculum Empirical combo sequences from footage Over-engineered for single-unit; natural at commercial scale

Structured Usability Study

The most significant gap in the current evidence base is the absence of structured user testing. A follow-on study with 8 to 12 recruited boxing gym members, using timed task scenarios (login to session start, combo training completion, results review) and a System Usability Scale questionnaire, would provide quantitative usability data and allow comparison against benchmark scores. This should be conducted on the physical robot in a gym environment, not in a lab.

Expanded Performance History Analytics

The current history pages display tabular results. Trend visualisation across sessions (power trend, stamina fatigue curve, reaction time improvement) would provide more actionable insight for users tracking progress over weeks. The data is already being stored per-user; the work is in the presentation layer only.

Cloud-Synced User Data

The per-user SQLite file approach is well suited to a single-unit deployment but is not designed for scale. A cloud-backed storage layer would allow a user's training history to follow them across multiple BoxBunny units at different gyms, and would enable coaches to review athlete progress remotely without needing physical access to the robot. The per-user folder structure already maps cleanly to a per-document model in a cloud database. This is over-engineered for the current single-unit context but becomes the natural architecture at gym-network scale.

Distributable Desktop Application

The current GUI is a Python library that requires the full development environment to run on the Jetson. For a larger deployment, packaging the analytics and configuration interface as a standalone executable would decouple it from the robot hardware entirely. Coaches and gym managers could run the interface on a Windows or macOS machine to review athlete data, configure curricula, and export reports without the robot being present. The GUI codebase is structured around the GuiBridge abstraction layer, so a desktop build with a remote bridge back to the robot is a natural extension of the existing architecture. This is over-engineered for a single-unit setup but practical for a commercial product.

ML-Derived Sparring Sequences from Boxing Video

The current Markov chain state transition matrices were authored manually based on boxing coaching principles. A model trained on labelled boxing footage could extract real punch sequences and transition probabilities from recorded matches, producing a sequence generator grounded in observed sparring behaviour rather than editorial judgement. Public boxing archives provide a large candidate corpus; the engineering work is in frame-level punch labelling and sequence extraction. The resulting transition matrices could replace the current hand-authored definitions, making the sparring engine considerably more stylistically varied at higher difficulty levels. This is over-engineered for the current system but becomes viable when scaling to support a wider range of boxing styles and skill levels.

ML-Seeded Combo Curriculum

The current 50-combination curriculum was authored manually, drawing on boxing coaching principles and common pad-work patterns. If a video-based sequence extraction pipeline of the kind described above were developed, the extracted sequences could seed an expanded curriculum: combinations that real boxers throw most frequently would be included as core drills, while less common sequences would appear in higher tiers. This would ground curriculum design in empirical frequency data rather than editorial selection, and would allow the curriculum to grow automatically as more footage is processed. Again, this is over-engineered for the current 50-combo scope but appropriate for a larger training product.