IS-431: Testing & Evaluation

Testing and Evaluation

This section presents the functional testing conducted during development iterations and user testing observations from the public showcase event.

V-Model alignment: This page corresponds to the right side of the V, where GUI requirements are verified against measurable outcomes.

The test results are also product feedback. They show whether the GUI felt usable as a training tool, not just whether the implementation was technically correct.

Developer Functional Testing

Testing was conducted incrementally throughout development. At each iteration, new features were verified by navigating the full affected flow and checking both the expected case and known failure modes. After each development session, a standard audit was run: navigate every affected page, confirm all PageIndex constants matched their addWidget positions, and verify that surrounding features were unaffected. The cases below document defects discovered and resolved through this process.

TC	Category	Observation	Root Cause	Resolution
TC-01	Navigation integrity	Performance page and Others page showed blank screens after sparring pages were added	Sparring pages inserted at index 22 in the `addWidget` block, shifting all subsequent page indices	Full index audit; block restored to correct order
TC-02	Navigation integrity	Five pages (Performance, Others, PowerInstructions, StaminaInstructions, ReactionInstructions) showed blank screens after an edit session	Page instantiation lines replaced with `QWidget()` stubs during automated editing; detection required searching `MainWindow.__init__` for all `self.x = QWidget()` patterns	Pages restored to correct class instantiations; audit pattern added to post-edit checklist
TC-03	Navigation integrity	App crashed on launch with RecursionError (992 frames) after circular import fix	`PunchCombinationPage.__init__` called `on_difficulty_clicked("Beginner")` immediately at construction rather than connecting it as a signal callback	All button connections converted to lambda callbacks
TC-04	Navigation integrity	Self-Select button navigated to wrong page; console output showed `False button clicked`	Qt `clicked` signal passes a boolean `checked` argument that overrides lambda default parameters	Pattern changed to `lambda checked=False: self.method("arg")` throughout all navigation lambdas
TC-05	Feature regression	Self-Select page showed Basic Parameters content after an edit session	Entire `SelfSelectSequencePage` class silently replaced with a simplified version missing the numpad, defense buttons, and reorder list	Class restored from last confirmed working version; verified by checking for `sequence_input`, `confirm_sequence`, and `move_sequence_up` identifiers
TC-06	Data layer	Combo training sessions failed to save after CV placeholder was wired to return 0-5 scores	SQLite `CHECK(score BETWEEN 0.0 AND 1.0)` constraint rejected scores above 1.0	Schema rebuilt with `CHECK(score BETWEEN 0.0 AND 5.0)`; mastery threshold updated to 3.0/5.0
TC-07	Navigation integrity	Back button on Performance History always returned to Settings regardless of entry path	Hardcoded back destinations on every page with no runtime state tracking	Navigation stack implemented in Iteration 5; entry point recorded before every transition
TC-08	Feature regression	Test account persisted in `users.json` after pressing Back on the proficiency checklist	`self._username` was empty string at Back press because `load_for_user()` call sequence was incorrect in the signup flow	Call sequence corrected; verified by creating and discarding a test account
TC-09	Feature regression	Fixed-size buttons on ProfiencyChecklistPage appeared at full navigation size (360x65 px) despite `setFixedSize()` calls	`setup_navigation()` in `ButtonNavigationMixin` applies `min-width/min-height` stylesheet overrides to all QPushButton children after construction	`SKIP_NAV_SETUP = True` class attribute added; inline `min-width: 1px; min-height: 1px` overrides applied
TC-10	Integration verification	Tooltip text across Homepage and TrainingPage did not match agreed specification after implementation	Content mismatch between implemented strings and specified tooltip text	Targeted correction pass across all `_attach_tooltips` methods; each tooltip verified against specification
TC-11	Data layer	Combo progress showed inflated percentage after few training sessions	Progress formula averaged only attempted combos, excluding unplayed combos from the denominator	Denominator fixed to always use total combo count at the user's level, so unplayed combos score as 0
TC-12	Data layer	Session counts in CSV did not match combo attempt counts in database after migration	Legacy CSV contained entries from all users mixed together from before per-user isolation was introduced	CSV history reset on migration; per-user SQLite database became sole source of truth
TC-13	Feature regression	Beginner-level users could access Sparring mode after proficiency assessment was added	Proficiency gate logic not yet wired into `SparPage` navigation handler	Gate condition added; Beginner users see a lockout message; Intermediate and Advanced proceed normally
TC-14	Integration verification	LLM chat showed placeholder-looking responses with no indication of whether the model was loaded	Fallback to hardcoded responses was silent; no status message distinguishing real LLM output from fallback	Explicit status messages added; chat always shows whether the local model, Anthropic API fallback, or hardcoded fallback is active

Table 5.1.3-1: Developer Functional Test Cases (incremental, across all six iterations)

Functional Testing

Functional testing was conducted on the features available at the time of the "Robotics Meets AI Showcase" event in late January 2026. The features tested were the combo curriculum system, the reaction time test, and the navigation stack. Test scenarios verified correct behaviour under normal use conditions.

The focus was on whether the product supported training flow cleanly. Each test asked a practical question: can the boxer start quickly, stay in the session, and get back to progress review without friction?

All functional tests passed. The navigation stack test is worth highlighting: the scenario traced a five-page deep navigation path and verified that pressing back four times returned through each page in the correct reverse order, confirming that the centralised stack handled arbitrary navigation depth correctly.

Automated Test Suite

The project runs 146 pytest tests across all subsystems. The GUI-relevant tests cover the following areas:

Navigation stack: arbitrary-depth push and pop sequences verify correct reverse-order page return
Multi-user data isolation: five parallel test accounts confirm zero cross-contamination across user data
Combo curriculum mastery algorithm: score accumulation, unlock threshold, and tier advancement logic
Gamification: XP calculation and rank thresholds across all six rank tiers
Pattern lock authentication: SHA-256 hash encode and verify round-trip
GuiBridge offline degradation: all service calls return failure cleanly when the ROS stack is unavailable, confirming the mock interface functions correctly

Tests are executed from the project's master Jupyter notebook and run independently of hardware. All GUI tests pass without a connected Jetson or ROS environment.

User Testing: Robotics Meets AI Showcase

Setup

Members of the public physically interacted with BoxBunny at the "Robotics Meets AI Showcase" event in late January 2026, specifically within the Robotics and AI in Work and Play segment. Two displays were available during the demonstration: the original 7-inch touchscreen mounted directly on the robot, and a larger external monitor connected to the same Jetson Orin NX showing the same GUI output.

This setup was useful because it exposed the product under real user conditions. The event showed how the GUI performed when people approached it as a training tool rather than as a demo screen.

Showcase Observations Summary

The following table summarises observations recorded at the showcase. This was a single convenience-sample event, not a structured usability study. No task completion times or formal rating scales were collected. Findings are treated as directional evidence from direct observation, not statistically significant results. Participant count was approximately >50 people who interacted with the robot during the event.

Observation	Users Affected	Severity	Action Taken
Difficulty reading combo prompts on the 7-inch display during active training	Majority of participants who attempted the 7-inch display	High	Display upgraded to 10.1-inch IPS capacitive touchscreen (1280x800)
Text too small to follow comfortably while focusing on punching technique	Several participants (reported verbally)	High	Confirmed display upgrade decision; proportional layout meant no redesign was required
Participants consistently gravitated toward the larger external monitor for reading results and navigating	All participants who used both displays	Medium	Validated the hardware revision direction before implementation
No readability or navigation complaints reported when using the larger monitor	All who switched to larger monitor	None	Confirmed adequate font sizing and layout hierarchy on larger display
Navigation between main menu and training pages required no verbal instruction	All observed participants	None	Navigation hierarchy confirmed intuitive for first-time users; no changes required

Table 5.1.3-3: Showcase Event Observations - "Robotics and AI in Work and Play" (single event, directional evidence only)

Observations

During the session, participants and observers consistently gravitated toward the larger monitor when reading training results and navigating between pages. Users who attempted to operate the 7-inch screen during active training reported difficulty reading combo prompts and statistics. Several participants noted that the text was too small to follow comfortably while also focusing on technique. Users who switched to the larger monitor did not report the same issues.

This confirmed that the product needed stronger readability and a clearer information hierarchy during active training. The issue was not only display size, but how well the GUI supported attention split between movement and screen reading.

Display Upgrade Decision

These observations validated a hardware revision. The 7-inch display was replaced with a 10.1-inch IPS capacitive touchscreen (1280x800 resolution, DSI interface). The upgrade improved text legibility and made the analytics pages easier to read during and after training. Because the GUI used proportional layout sizing rather than fixed pixel coordinates, the existing interface scaled to the new resolution without requiring a redesign cycle.

The decision followed the user evidence rather than the hardware preference. It was a product change driven by how boxers actually used the interface in training.

Automated Performance Benchmarks

Scope and Test Environment

To complement the functional test cases above, a dedicated benchmark script (tests/benchmark_gui.py) was written to instrument five quantitative performance targets defined during design. The script was executed on the development machine (Windows 10, PySide6 offscreen mode) prior to integration with the Jetson Orin NX, Teensy, and Ubuntu production environment. This is consistent with the subsystem-independent development strategy used throughout the project: because the GUI communicates with all hardware subsystems through an abstracted integration layer, its core performance characteristics can be verified in isolation. Full system-level performance under live hardware conditions was not captured within the project timeline and is noted as a limitation below. The specific data structure and architectural choices that underpin these results are documented in the Implementation section.

Response Time Target Rationale

Response time targets were grounded in Nielsen's (1993) three response time limits for interactive systems. Nielsen established that 1.0 second is the upper boundary for maintaining uninterrupted user flow, and 10 seconds is the limit before a user's attention is lost entirely. For a training interface where users are actively mid-session, page transitions must remain well below the 1-second threshold to avoid disrupting the training experience. The 500 ms target for page transitions and the 3000 ms startup target were set accordingly. Database query latency was bounded at 200 ms, consistent with the sub-second flow maintenance threshold: queries that feed into page rendering must resolve before the user perceives any delay (Nielsen, 1993).

Measurement Methodology

Each test was run for a fixed number of repetitions with mean, standard deviation, and maximum reported. The first startup run was discarded as a warmup due to cold Python import cache effects. Navigation transition times resolved within a single Qt event loop tick on the offscreen renderer, producing sub-millisecond values; these are reported as <1 ms and confirm the target is met, though exact on-device values under a live framebuffer will differ. Memory consumption was measured using psutil process RSS at three checkpoints.

Benchmark Results

Test	Metric	Target	n	Mean	Std	Max	Result
1	Application startup time	<3000 ms	5	758 ms	33 ms	811 ms	PASS
2	home → training session	<500 ms	10	<1 ms	0 ms	1 ms	PASS
	home → coach station	<500 ms	10	<1 ms	0 ms	1 ms	PASS
	home → history	<500 ms	10	<1 ms	0 ms	1 ms	PASS
	home → settings	<500 ms	10	<1 ms	0 ms	2 ms	PASS
3	Auth lookup	<200 ms	20	0.1 ms	0.0 ms	0.2 ms	PASS
	Session history (50 rows)	<200 ms	20	0.3 ms	0.1 ms	0.6 ms	PASS
	Combo lookup	<200 ms	20	0.1 ms	0.0 ms	0.2 ms	PASS
4	Memory: idle baseline	<500 MB	1	123 MB	--	--	PASS
	Memory: after page traversal	<500 MB	1	126 MB	--	--	PASS
	Memory: after session simulation	<500 MB	1	126 MB	--	--	PASS
5	Data isolation (5 users, 25 assertions)	0 contamination	25	0 cross-user records detected			PASS

Table 5.1.3-3: Automated performance benchmark results. Run on Windows 10 development machine, PySide6 offscreen mode, prior to Jetson hardware integration. Navigation values of <1 ms reflect offscreen rendering resolution limits and confirm the target is met; on-device framebuffer values are expected to be higher but remain within the 500 ms threshold (Nielsen, 1993).

Benchmark Limitations

All five tests passed on the development machine. Two limitations apply to the interpretation of these results. First, startup time was measured against a warm Python environment; the first run was discarded as a warmup, so the reported 758 ms mean reflects cached-import conditions. Cold startup on the Jetson at power-on may be higher. Second, page transition times resolved within a single Qt event loop tick on the offscreen renderer. The sub-millisecond values confirm the target is satisfied but do not represent the true on-device rendering latency under a live display. Re-running the benchmark script on the Jetson with a connected display will yield more representative transition figures.

References

Nielsen, J. (1993). Usability Engineering. Academic Press.

Performance Criteria Assessment

The table below maps each GUI requirement to its target, verification method, and outcome.

ID	Requirement	Target	Verification	Met?
GUI-1	Touchscreen and physical button operable: large touch targets, minimal text input, IMU-based button navigation	All interactive elements ≥60px; full navigation via 4 pads	Functional test; showcase observation; IMU nav confirmed in live session	Yes
GUI-2	Multi-user accounts with complete data isolation between users	Zero cross-contamination across accounts	5 test accounts; pytest isolation tests; no cross-user data observed	Yes
GUI-3	Structured training progression through a 50-combo curriculum with mastery-based advancement	Unlock after 5 sessions with average score ≥3.0/5.0	Curriculum algorithm pytest; multiple complete training cycles observed	Yes
GUI-4	Real-time session data display (combo prompts, round timers, performance metrics)	Combo prompt, timer, and punch counter update within 33 ms (one frame at 30 fps, the Jetson Nano DSI display refresh rate)	Observed during live sessions; no visible lag between punch event and counter update; Qt event loop confirmed single-threaded update on signal receipt	Yes
GUI-5	Hardware-independent development via an abstracted integration layer with mock interfaces	All pages navigable and all features operable with no serial hardware connected; mock-interface mode activates automatically on hardware-absent startup; no import errors or page crashes on laptop	Entire development cycle conducted on laptop; mock-interface tests confirmed expected behavior	Yes

Table 5.1.3-4: GUI Performance Criteria Assessment

Requirements Verification

The table below maps each GUI requirement back to its verification evidence.

Limitations

The following limitations apply to the testing evidence presented in this section. They are stated to give an accurate picture of what the results can and cannot claim.

User Testing Scope

The showcase event was a single convenience-sample demonstration, not a structured usability study with recruited participants. No task completion times, error counts, or structured rating scales (such as the System Usability Scale) were collected. Observations reflect what was directly visible during approximately two hours of public interaction at one event. Conclusions drawn from this data are directional only and cannot be generalised to the full target user population.

Hardware Access During Development

The GUI was developed predominantly on a Windows laptop using mock interfaces. Integration testing against live robot hardware was limited by hardware availability during the development phase. Some edge cases in serial communication, CV handshake timing, and multi-subsystem state transitions may not have been surfaced under the simulation conditions used.

Reflection

What Worked

Iteration Control and Defect Containment

The iterative development approach was the correct call for this project. Every significant defect discovered during testing was caught and resolved within the iteration that introduced it. The score scale mismatch (TC-06), navigation stack bug (TC-07), and blank page pattern (TC-02) would all have been costly to fix in a later phase. Catching them early kept the codebase stable.

Mock Interfaces and Integration Readiness

The mock interface architecture was the single most valuable structural decision. By abstracting all hardware communication behind a GuiBridge layer with a swap-in mock, the full GUI was runnable and testable on a Windows laptop throughout development. This meant that six complete iterations were validated before any physical hardware connection was required. The integration work at the end of the project was straightforward precisely because the interfaces had been designed and tested in isolation first.

Data Isolation Confidence

The per-user SQLite database design also performed well under test. Five test accounts were created and exercised across training, performance, and sparring flows with no observed cross-user data contamination.

Future Work

Structured Usability Study

The most significant gap in the current evidence base is the absence of structured user testing. A follow-on study with 8 to 12 recruited boxing gym members, using timed task scenarios (login to session start, combo training completion, results review) and a System Usability Scale questionnaire, would provide quantitative usability data and allow comparison against benchmark scores. This should be conducted on the physical robot in a gym environment, not in a lab.

Expanded Performance History Analytics

The current history pages display tabular results. Trend visualisation across sessions (power trend, stamina fatigue curve, reaction time improvement) would provide more actionable insight for users tracking progress over weeks. The data is already being stored per-user; the work is in the presentation layer only.

Cloud-Synced User Data

The per-user SQLite file approach is well suited to a single-unit deployment but is not designed for scale. A cloud-backed storage layer would allow a user's training history to follow them across multiple BoxBunny units at different gyms, and would enable coaches to review athlete progress remotely without needing physical access to the robot. The per-user folder structure already maps cleanly to a per-document model in a cloud database. This is over-engineered for the current single-unit context but becomes the natural architecture at gym-network scale.

Distributable Desktop Application

The current GUI is a Python library that requires the full development environment to run on the Jetson. For a larger deployment, packaging the analytics and configuration interface as a standalone executable would decouple it from the robot hardware entirely. Coaches and gym managers could run the interface on a Windows or macOS machine to review athlete data, configure curricula, and export reports without the robot being present. The GUI codebase is structured around the GuiBridge abstraction layer, so a desktop build with a remote bridge back to the robot is a natural extension of the existing architecture. This is over-engineered for a single-unit setup but practical for a commercial product.

ML-Derived Sparring Sequences from Boxing Video

The current Markov chain state transition matrices were authored manually based on boxing coaching principles. A model trained on labelled boxing footage could extract real punch sequences and transition probabilities from recorded matches, producing a sequence generator grounded in observed sparring behaviour rather than editorial judgement. Public boxing archives provide a large candidate corpus; the engineering work is in frame-level punch labelling and sequence extraction. The resulting transition matrices could replace the current hand-authored definitions, making the sparring engine considerably more stylistically varied at higher difficulty levels. This is over-engineered for the current system but becomes viable when scaling to support a wider range of boxing styles and skill levels.

ML-Seeded Combo Curriculum

The current 50-combination curriculum was authored manually, drawing on boxing coaching principles and common pad-work patterns. If a video-based sequence extraction pipeline of the kind described above were developed, the extracted sequences could seed an expanded curriculum: combinations that real boxers throw most frequently would be included as core drills, while less common sequences would appear in higher tiers. This would ground curriculum design in empirical frequency data rather than editorial selection, and would allow the curriculum to grow automatically as more footage is processed. Again, this is over-engineered for the current 50-combo scope but appropriate for a larger training product.

Back: Implementation

Next: 5.2 Robot Mechanism