Sokoban State Space Analysis:
AI Solver Performance & Complexity
AMOD 5610/5620Y Big Data Analytics — Final Deliverable
Saud Ahmed Abbasi
Introduction

Sokoban is a classic puzzle game where a player pushes boxes on a grid to designated target locations — boxes can only be pushed (never pulled), making it a PSPACE-complete planning problem.

A Sokoban-style game built in Godot 4 with eight mechanics — including block duplication — also runs as a data pipeline. A Python solver reads CSV grid exports and sends each level to beam search or parallel priority-queue search across up to 9 threads.

The question: does an AI solver's own behaviour tell us more about puzzle complexity than the puzzle's structure does?

Research Questions
1
How do mechanics & level design generate implicit complexity as a data-producing system?
2
What does structured gameplay data reveal about solver behaviour and difficulty?
3
How does mechanic complexity affect AI solver performance across levels?
Method
Godot 4 / GDScript 35 levels → CSV grid export DataLogger (GDScript singleton) persists across scenes · writes 5 CSV tables human path AI path Human Player direct gameplay Level Grid CSVs board state files Event CSV Records 5 structured tables AI Python Solver beam & queue · 9 threads AIController replays solution in Godot — or — Structured Analysis Dataset sessions · levels · objects · events · dup_events Research Question Analysis
GROUP A-12

Design a puzzle. You won't know how hard it is until the AI breaks a sweat.

Two AI solvers across 35 levels show that object counts barely predict effort. What drove difficulty was how the AI played — duplication depth, board geometry, and search strategy left far stronger signals in the data than the level design ever did.
Key Insights
Puzzle structure barely predicts AI effort
OLS regression: block & button counts explain only 4% of effort variance (R²=0.031, p=0.31). Layout geometry dominates.
Solver behaviour is a far stronger signal
Action mix, RESET rate & duplication depth outpredict level structure. B011 (1 block, 1 button) took 3,135 actions — the hardest level by effort.
Duplication depth: non-linear preference
Solver strongly favoured originals & 1st-gen copies (~3.82 Δ at R=1). Deeper blocks land in tighter positions with fewer valid cells.
Puzzle Mechanics — 8 Types
Block Pushing PSPACE-complete core
Charged Push Directional force modifier
Duplication Shot Up to 4 copies at radius r
Charged Dup. Wider radius variant
Floor-Fill (FB) Fills hole tiles with floor
Hole-Blocks (HB) Punches holes in floor
Switchables Toggle object interactions
Map Movers Shift entire tile regions
AI Solver Design — Two Strategies
Beam Search
B-World — standard blocks & buttons
Keeps a ranked shortlist of the most promising board states and drops the rest. When it runs dry, the shortlist grows and the search restarts.
10
25
50
100
← shortlist size per retry
Score: button coverage + dup reach + label match − wall-trap penalty
Parallel Priority Queue
FB / HB / Mixed — floor-fill & hole-blocks
9 threads, each seeded with a different opening move. They share one global state counter — first to find a solution stops the rest.
none
U
D
L
R
↑→U+R
↑←U+L
↓→D+R
↓←D+L
Shared 10M-state budget. Best partial path saved if unsolved.
Trent University Trent University
Dataset & Logging Schema

35 levels (29 B-series, 3 C, 3 D) across 22 AI sessions. The solver spawned ~4,100 blocks via duplication — above the 1.58 placed per level on average. Max depth: 20 generations. All data synthetic and deterministic.

sessionssession_id, solver_type, levels_completed, time_s22 rows
levelslevel_id, series, grid_w, grid_h, n_blocks, n_buttons35 rows
objectsobject_id, level_id, parent_id, spawn_depth, type~4,100
eventsevent_id, session_id, action, pos_x, pos_y, direction~18,000
dup_eventsdup_id, object_id, radius, branching_delta, depth~4,100
RQ1 — Duplication Analysis
Branching Δ by Radius
3.0 3.3 3.6 4.0 Branching Δ 3.82 3.53 3.47 R=1 R=2 R=3 Duplication Radius Kruskal-Wallis p<0.0001
Wider radius hits more walls — each step cuts how many new objects actually land. Kruskal-Wallis test: all three radius groups produce statistically distinct branching counts.
Dup. Depth Preference
0 100 200 Count 241* 0 1 2 3 4 5+ Depth (0 = original block) Chi-square p<0.0001
Solver strongly favoured originals and 1st-gen copies. Chi-square test confirms this is a genuine preference, not random — deeper blocks sit in tighter board positions with fewer valid landing cells.
RQ2 & RQ3 — Effort Analysis
Effort by Series
0 1k 2k 3k Total Events B011! B C D Design Series Kruskal-Wallis p=0.026
B-series difficulty lives in variance, not medians — B011 took 3,135 actions on a proxy score of just 4. Kruskal-Wallis test: the three series have statistically different effort distributions.
Proxy vs. Effort
0 1k 2k 3k Total Events 0 4 8 12 B011 Complexity Proxy OLS: R²=0.031, p=0.31
Nearly flat regression line. OLS regression shows block × button count explains under 4% of effort variance (R²=0.031, p=0.31 — not significant). Layout geometry, not object count, drove difficulty.