AMOD Big Data – Sokoban Poster v18

Sokoban State Space Analysis:
AI Solver Performance & Complexity

AMOD 5610/5620Y Big Data Analytics — Final Deliverable

Saud Ahmed Abbasi

Introduction

Sokoban is a classic puzzle game where a player pushes boxes on a grid to designated target locations — boxes can only be pushed (never pulled), making it a PSPACE-complete planning problem.

A Sokoban-style game built in Godot 4 with eight mechanics — including block duplication — also runs as a data pipeline. A Python solver reads CSV grid exports and sends each level to beam search or parallel priority-queue search across up to 9 threads.

The question: does an AI solver's own behaviour tell us more about puzzle complexity than the puzzle's structure does?

Research Questions

How do mechanics & level design generate implicit complexity as a data-producing system?

What does structured gameplay data reveal about solver behaviour and difficulty?

How does mechanic complexity affect AI solver performance across levels?

Method

GROUP A-12

Design a puzzle. You won't know how hard it is until the AI breaks a sweat.

Two AI solvers across 35 levels show that object counts barely predict effort. What drove difficulty was how the AI played — duplication depth, board geometry, and search strategy left far stronger signals in the data than the level design ever did.

Key Insights

Puzzle structure barely predicts AI effort

OLS regression: block & button counts explain only 4% of effort variance (R²=0.031, p=0.31). Layout geometry dominates.

Solver behaviour is a far stronger signal

Action mix, RESET rate & duplication depth outpredict level structure. B011 (1 block, 1 button) took 3,135 actions — the hardest level by effort.

Duplication depth: non-linear preference

Solver strongly favoured originals & 1st-gen copies (~3.82 Δ at R=1). Deeper blocks land in tighter positions with fewer valid cells.

Puzzle Mechanics — 8 Types

      
      Block Pushing
      PSPACE-complete core

      
      Charged Push
      Directional force modifier

      
      Duplication Shot
      Up to 4 copies at radius r

      
      Charged Dup.
      Wider radius variant

Floor-Fill (FB) Fills hole tiles with floor

Hole-Blocks (HB) Punches holes in floor

Switchables Toggle object interactions

Map Movers Shift entire tile regions

AI Solver Design — Two Strategies

Beam Search

B-World — standard blocks & buttons

Keeps a ranked shortlist of the most promising board states and drops the rest. When it runs dry, the shortlist grows and the search restarts.

100

← shortlist size per retry

Score: button coverage + dup reach + label match − wall-trap penalty

Parallel Priority Queue

FB / HB / Mixed — floor-fill & hole-blocks

9 threads, each seeded with a different opening move. They share one global state counter — first to find a solution stops the rest.

☉none

↑U

↓D

←L

→R

↑→U+R

↑←U+L

↓→D+R

↓←D+L

Shared 10M-state budget. Best partial path saved if unsolved.

Dataset & Logging Schema

35 levels (29 B-series, 3 C, 3 D) across 22 AI sessions. The solver spawned ~4,100 blocks via duplication — above the 1.58 placed per level on average. Max depth: 20 generations. All data synthetic and deterministic.

sessionssession_id, solver_type, levels_completed, time_s22 rows

levelslevel_id, series, grid_w, grid_h, n_blocks, n_buttons35 rows

objectsobject_id, level_id, parent_id, spawn_depth, type~4,100

eventsevent_id, session_id, action, pos_x, pos_y, direction~18,000

dup_eventsdup_id, object_id, radius, branching_delta, depth~4,100

RQ1 — Duplication Analysis

Branching Δ by Radius

Wider radius hits more walls — each step cuts how many new objects actually land. Kruskal-Wallis test: all three radius groups produce statistically distinct branching counts.

Dup. Depth Preference

Solver strongly favoured originals and 1st-gen copies. Chi-square test confirms this is a genuine preference, not random — deeper blocks sit in tighter board positions with fewer valid landing cells.

RQ2 & RQ3 — Effort Analysis

Effort by Series

B-series difficulty lives in variance, not medians — B011 took 3,135 actions on a proxy score of just 4. Kruskal-Wallis test: the three series have statistically different effort distributions.

Proxy vs. Effort

Nearly flat regression line. OLS regression shows block × button count explains under 4% of effort variance (R²=0.031, p=0.31 — not significant). Layout geometry, not object count, drove difficulty.