This project implements a vectorized Dueling Deep Q-Network (DQN) to train an AI agent to play Snake. It was developed for a 3rd-year university project (SAE) at UPJV (Université de Picardie Jules Verne).
Authors: Stéphane TALAB (@ProGen18) & Mathis PIAULT (@Dova3kin)
- Context
- Features
- Tech Stack
- Architecture
- Neural Network
- State Vector (26 Features)
- Action Selection
- Memory System
- Dashboard
- Configuration
- Getting Started
- Scripts
- Tests
- Troubleshooting
This project is a SAE (Situation d'Apprentissage et d'Évaluation) for the 3rd year of Computer Science at UPJV. We built a reinforcement learning pipeline from scratch: vectorized engine, neural network, training loop, and a real-time dashboard.
- Vectorized Environments: Runs
NSnake instances in parallel using NumPy, making experience collection much faster. - Dueling DQN: Separates state value
V(s)and action advantagesA(s,a)to improve convergence. - Double DQN: Reduces Q-value overestimation by using a target network for evaluation.
- N-step Returns: Multi-step returns (default n=3) for better credit assignment.
- Epsilon-Greedy: Linear decay of exploration from 1.0 to 0.0.
- Strategy Blending: Combines model output, a greedy heuristic, and safe-random moves based on epsilon.
- Safe-Random Fallback: Avoids immediate wall/body collisions during random exploration.
- Greedy Heuristic: Hand-coded logic to move toward food, used as reference early on.
- Experience Replay: 200k-capacity FIFO buffer using NumPy arrays.
- Apple Oversampling: A dedicated 20k buffer for food-gathering transitions to prevent the agent from "forgetting" how to eat.
- Soft Updates: Gradual target network updates (
tau=0.005) for stability. - LR Scheduler:
ReduceLROnPlateauhalves learning rate when performance stalls (active below 0.2 epsilon). - Gradient Clipping: Limited to
max_norm=1.0. - Auto-Saving: Models save automatically when a new record is reached.
- Evaluations: Periodic greedy tests (100 episodes) to track true performance.
- Starvation Timeout: Kills the snake if it doesn't eat within a calculated timeframe.
- Flood Fill: BFS-based dead-end detection included in the state vector.
- Body Density: Monitors crowding around the head.
- Lookahead: Includes 2-step danger detection.
- Tail Sensor: Heads-to-tail vector helps the agent sense its own body shape.
- Fully Vectorized: N environments advance simultaneously using NumPy (no Python loops).
- Actions: Straight, Turn Right, Turn Left (relative to current direction).
- Body Arrays: Positions stored in pre-allocated arrays to avoid dynamic allocation.
- Collisions: Calculated via boolean masks and indexing over the full batch.
- Partial Resets: Resets only the terminated environment.
- Apple Spawning: Automatic placement on empty tiles.
- 4-Quadrant Flexible Layout: The dashboard is divided into 4 panels that can each display:
JEU(game view),COURBE(live score graph),HISTORIQUE(score scatter), orVISION(state feature bars). - Panel Cycling: Clicking on a panel's title cycles through the 4 available views.
- Left Menu: 3 sections (STATS, CONTROLS, and OPTIONS).
- 6 Live Stat Cards: Parties (games), Record, Moy100 (100-game moving average), TPS (steps/second), Temps (elapsed time), Epsilon.
- 5 Control Buttons: Save, Load, Pause/Resume, Screenshot, Export Excel.
- Auto-Screenshot: Configurable interval (in seconds) to capture the dashboard automatically.
- Pygame-native Score Graph: Rolling line chart (last 2000 scores). Rendered in pure Pygame (no Matplotlib overhead).
- Matplotlib History Scatter: Session-wide scatter plot of all scores, updated every 5 seconds.
- State Feature Visualizer: Bar chart of all 26 input features fed to the network.
- Save/Load Modal: File browser modal to choose save files from the
model/directory. - Excel Export: Full training history (timestamp, games, epsilon, record, average, TPS) exported to an
.xlsxfile inlogs/. - Console Logging: Timestamped log lines printed to stdout at a configurable interval.
| Library | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Language |
| PyTorch | 2.0+ | Neural network, training |
| Pygame | 2.5+ | Game rendering & dashboard UI |
| NumPy | 1.24+ | Vectorized environment & memory |
| Pandas | 2.0+ | Data handling for logging |
| Matplotlib | 3.7+ | History scatter plot |
| openpyxl | 3.1+ | Excel export |
| IPython | 8.10+ | Interactive console support |
Snake_AI/
├── agent.py # Main training loop, Agent class, replay buffer, rendering
├── config.py # All hyperparameters (ConfigEntrainement + ConfigAffichage)
├── dashboard.py # Pygame dashboard — 4-panel layout, menu, controls
├── game.py # Vectorized Snake engine (N parallel environments)
├── logger.py # Console logging + Excel export
├── model.py # Dueling DQN network + Trainer (optimizer, target net)
├── widgets.py # Reusable UI components (Button, InputBox, StatCard, Graph)
├── requirements.txt # Python dependencies
├── tests/
│ ├── test_agent.py # Tests for MemoireEfficace (replay buffer)
│ ├── test_game.py # Tests for JeuVectorise (game engine)
│ ├── test_model.py # Tests for ReseauNeurones + Entraineur
│ ├── test_widgets.py # Tests for all UI widget classes
│ └── test_dashboard.py
├── model/ # Saved model checkpoints (.pth)
├── logs/ # Excel training logs (.xlsx)
├── screenshots/ # Auto-captured screenshots (.png)
├── src/
└── └── demoSnake.gif # Demo animation (used in README)
| Module | Class(es) | Responsibility |
|---|---|---|
agent.py |
MemoireEfficace, RenduPygame, AgentIA |
Training loop, replay buffer, Pygame render |
game.py |
JeuVectorise |
N-environment Snake engine |
model.py |
ReseauNeurones, Entraineur |
Network definition, optimizer, target net |
dashboard.py |
Dashboard |
Full Pygame UI orchestration |
widgets.py |
Bouton, BoiteSaisie, StatCard, GraphiquePygame |
Reusable UI components |
logger.py |
JournalDeBord |
Timestamped logs + Excel export |
config.py |
ConfigEntrainement, ConfigAffichage |
All constants and hyperparameters |
The model is a Dueling DQN (ReseauNeurones in model.py).
Input: 26 features
│
Linear(26 → 256) + LayerNorm + ReLU
│
Linear(256 → 256) + LayerNorm + ReLU
│
┌────┴────┐
│ │
Value Advantage
stream stream
│ │
Lin(256→128) Lin(256→128)
+ ReLU + ReLU
│ │
Lin(128→1) Lin(128→3)
│ │
└────┬────┘
│
Q(s,a) = V(s) + (A(s,a) − mean_a(A(s,a)))
Output: 3 Q-values → argmax = chosen action
Training algorithm:
- Double DQN:
action* = argmax_a Q_main(s', a)→ evaluated asQ_target(s', action*) - Bellman target:
Q_target = R + γ^n × Q_next × (1 − done) - Loss: SmoothL1Loss (Huber)
- Optimizer: Adam
- Soft update:
θ_target ← τ × θ_main + (1−τ) × θ_targetafter every step
Checkpoint format (.pth files):
{
"model_state": ...,
"optimizer_state": ...,
"nb_parties": int,
"temps_total": float,
"epsilon": float,
"record": int
}The network receives a vector of 26 normalized floats:
| Index | Feature(s) | Description |
|---|---|---|
| 0 | dist_pomme |
Manhattan distance to food, normalized by grid diagonal |
| 1–2 | dir_x, dir_y |
Signed direction to food (x and y), normalized |
| 3–5 | danger_straight, danger_right, danger_left |
1.0 if moving in that direction causes immediate collision with wall or body, else 0.0 |
| 6 | faim |
Steps since last food / famine timeout — hunger level (0.0 to 1.0) |
| 7–8 | pos_x, pos_y |
Head position on the grid, normalized by grid width/height |
| 9–11 | flood_straight, flood_right, flood_left |
BFS-accessible cells for each action, normalized by total grid size |
| 12–15 | dir_right, dir_down, dir_left, dir_up |
One-hot encoding of the snake's current absolute direction |
| 16 | longueur |
Snake length / max possible length |
| 17–19 | mur_straight, mur_right, mur_left |
Normalized distance to wall in each relative direction |
| 20–21 | queue_x, queue_y |
Normalized vector from head to tail |
| 22–24 | danger2_straight, danger2_right, danger2_left |
1.0 if a collision would occur 2 steps ahead |
| 25 | densite_corps |
Ratio of body cells within a densite_rayon-cell radius of the head |
At each step, we use a blended strategy:
With probability (1 − epsilon):
→ Model-based: argmax Q(state)
With probability epsilon:
→ With decreasing probability: Greedy Heuristic (move toward food, safe fallback)
→ Otherwise: Safe Random (random action among collision-free moves)
The greedy heuristic and safe-random strategies both avoid moves that result in immediate death whenever a safe option exists.
| Property | Value |
|---|---|
| Capacity | 200,000 transitions |
| Structure | Pre-allocated NumPy ring buffer (FIFO) |
| Stores | (state, action, reward, next_state, done) |
| Sampling | Uniform random |
| Batch size | 256 |
| Property | Value |
|---|---|
| Capacity | 20,000 transitions |
| Stores | Only transitions where food was collected |
| Purpose | Oversample rare reward events |
| Threshold | Activated when apple memory has ≥ 128 entries |
| Samples per batch | 128 additional samples mixed with main batch |
Views: JEU, COURBE, HISTORIQUE, VISION. Switch panels by clicking their titles.
| Card | Description |
|---|---|
| Parties | Total games |
| Record | Best score |
| Moy100 | 100-game average |
| TPS | Steps per second |
| Temps | Duration |
| Epsilon | Exploration rate |
| Button | Key | Action |
|---|---|---|
| Sauvegarder | S |
Save model |
| Charger | L |
Load model |
| Pause / Resume | Space |
Toggle training |
| Capture | C |
Dashboard screenshot |
| Export Excel | E |
Training history log |
| Option | Default | Description |
|---|---|---|
| Auto-screenshots toggle | Off | Automatically captures screenshots at fixed interval |
| Interval input box | 60 seconds |
Time between auto-screenshots when enabled |
| Button | Action |
|---|---|
| Quitter | Closes the dashboard and stops training |
| Panel | Content | Update rate |
|---|---|---|
| JEU | Pygame render of environment 0 — snake with color gradient (head brighter), red apple | Every frame |
| COURBE | Pygame line chart — raw scores (blue), moving average (orange), record (gold dashed) | Every frame |
| HISTORIQUE | Matplotlib scatter of all session scores | Every ~5 seconds |
| VISION | Vertical bar chart of all 26 input features | Every frame |
- Rectangular button with hover color transition.
- Click detected via rect collision.
- Supports dynamic text via setter.
- Callback function triggered on click.
- Text input field with focus management.
- Click to activate, click away to deactivate.
- Keyboard input with backspace support.
- Returns text on Enter key press.
- Fixed height: 52px.
- Displays a label and a large value.
- Optional bar mode: shows a progress bar with gradient from red to green.
- Render is safe: defaults to "N/A" if the value is missing.
- Rolling buffer: last 2000 scores (deque with maxlen).
- Draws axes, grid lines, and a legend.
- Auto-scales Y axis to data range.
- Record line drawn as a dashed gold horizontal line.
Defined in config.py.
| Parameter | Default | Type | Description |
|---|---|---|---|
graine |
42 |
int |
Random seed. Change to reproduce different training runs. |
nb_environnements |
20 |
int |
Number of parallel game instances. More = faster experience collection, higher RAM/CPU usage. |
taux_apprentissage |
0.0003 |
float |
Adam optimizer learning rate. Try 0.001 for faster (but less stable) learning. |
gamma |
0.97 |
float |
Future reward discount factor. Lower = more short-sighted agent. |
epsilon_depart |
1.0 |
float |
Initial exploration rate. Usually keep at 1.0 for a fresh start. |
epsilon_fin |
0.0 |
float |
Final exploration rate. Can be set to 0.05 to always keep some exploration. |
epsilon_frames |
1000 |
int |
Number of training steps to decay epsilon from start to end. Increase for longer exploration. |
transitions_min_debut |
5000 |
int |
Minimum replay buffer size before training starts. Reduce to 1000 for quick tests. |
eval_intervalle |
5000 |
int |
Steps between greedy evaluation runs. Lower to see test scores more frequently. |
nb_episodes_test |
100 |
int |
Number of greedy episodes per evaluation. Reduce for speed. |
recompense_pomme |
1.0 |
float |
Reward for eating food. Increase to make food more important. |
recompense_mort |
-1.0 |
float |
Penalty for dying. Increase magnitude to punish deaths more. |
recompense_step |
-0.001 |
float |
Small penalty per step (encourages efficiency). Set to 0 to remove time pressure. |
famine_base |
100 |
int |
Base number of steps before starvation (without food). |
famine_par_case |
3 |
int |
Additional steps per snake body block before starvation. |
taille_batch |
256 |
int |
Batch size for main training. Larger = smoother gradients, more memory. |
memoire_max |
200000 |
int |
Replay buffer capacity. Reduce to 50000 for lower memory usage. |
log_intervalle_sec |
1.0 |
float |
Console log interval in seconds. Set to 5.0 for less noise. |
graph_update_intervalle |
100 |
int |
Dashboard graph update frequency (every N steps). |
| Parameter | Default | Type | Description |
|---|---|---|---|
tau |
0.005 |
float |
Target network soft update rate. Too high = unstable, too low = slow to learn. |
n_step |
3 |
int |
N-step return window. Affects how far back rewards are propagated. |
blend_frames |
150 |
int |
Steps for blending between heuristic and model-based exploration. |
freq_entrainement |
8 |
int |
Training frequency: one gradient step every N environment steps. |
taille_batch_pqn |
512 |
int |
Batch size for PQN-style training step. |
memoire_pommes_capacite |
20000 |
int |
Apple memory buffer capacity. |
memoire_pommes_seuil |
128 |
int |
Minimum apple memory size before oversampling is active. |
memoire_pommes_echantillons |
128 |
int |
Apple samples added per training batch. |
flood_depth_facteur |
0.5 |
float |
Flood fill BFS depth limit = facteur × snake_length. |
flood_depth_max |
50 |
int |
Hard cap on flood fill depth. |
densite_rayon |
3 |
int |
Radius (in cells) for body density computation. |
scores_historique_maxlen |
500 |
int |
Max history length for score tracking. |
scores_test_maxlen |
100 |
int |
Max history length for test set scores. |
lr_scheduler_epsilon_seuil |
0.2 |
float |
Epsilon threshold below which LR scheduler activates. |
lr_scheduler_patience |
100 |
int |
Patience for LR scheduler (in log intervals). |
lr_scheduler_factor |
0.5 |
float |
LR reduction factor on plateau. |
lr_min |
1e-6 |
float |
Minimum learning rate (floor for scheduler). |
| Parameter | Default | Type | Description |
|---|---|---|---|
input_size |
26 |
int |
Network input size: must match the 26-feature state vector exactly. |
output_size |
3 |
int |
Number of actions (Straight, Right, Left). |
largeur |
640 |
int |
Game area width in pixels. Changing this changes the grid size and breaks saved models. |
hauteur |
480 |
int |
Game area height in pixels. Same caveat. |
taille_bloc |
20 |
int |
Cell size in pixels. Grid = (largeur/taille_bloc) × (hauteur/taille_bloc) = 32×24. |
taille_couche_1 |
256 |
int |
First shared hidden layer size. Changing breaks saved models. |
taille_couche_2 |
256 |
int |
Second shared hidden layer size. |
taille_couche_v |
128 |
int |
Value stream layer size. |
taille_couche_a |
128 |
int |
Advantage stream layer size. |
| Parameter | Default | Description |
|---|---|---|
largeur_fenetre |
1920 |
Window width |
hauteur_fenetre |
1080 |
Window height |
largeur_menu |
250 |
Menu width |
hauteur_barre_bas |
40 |
Status bar height |
intervalle_screenshot_defaut |
60 |
Auto-screenshot timer |
git clone https://github.com/ProGen18/Snake_AI.git
cd Snake_AIpython -m venv venv
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activatepip install -r requirements.txtpython agent.pyThis launches the full training loop with the Pygame dashboard open. The agent will start exploring randomly and gradually shift toward learned behavior as epsilon decays.
| Command | Description |
|---|---|
python agent.py |
Start training with live dashboard |
pytest |
Run the full test suite |
pytest -v |
Verbose test output |
pytest tests/test_model.py |
Run model tests only |
pytest tests/test_game.py |
Run game engine tests only |
pytest tests/test_agent.py |
Run replay buffer tests only |
pytest tests/test_widgets.py |
Run UI widget tests only |
The project uses pytest. Tests cover all major components.
| File | What it tests |
|---|---|
test_agent.py |
MemoireEfficace: storage, ring-buffer overflow, circular wrapping, random sampling |
test_game.py |
JeuVectorise: initialization, partial reset, state computation, flood fill, danger flags, tail direction, body density |
test_model.py |
ReseauNeurones: forward shapes, NaN/Inf check, save/load roundtrip. Entraineur: learning step, soft update, loss decrease |
test_widgets.py |
Bouton: click/miss callbacks; BoiteSaisie: focus, input, Enter; StatCard: value set, bar mode render; GraphiquePygame: add points, draw empty/full, record line |
The Pygame event loop runs inside the training loop. If batch sizes are very large or the CPU is saturated, the UI may lag. Reduce nb_environnements or taille_batch if this happens.
The agent automatically falls back to CPU if no compatible GPU is found. To use a GPU, install the CUDA-enabled version of PyTorch matching your system's driver version.
Activate your virtual environment first, then run pip install -r requirements.txt.
Ensure the .pth file was saved by the same version of ReseauNeurones. Legacy saves (without optimizer state) are partially supported via a fallback loader.
- Check that
transitions_min_debutis not set too high: training won't start until the buffer is filled. - Increase
epsilon_framesif the agent doesn't explore enough before exploitation begins. - If scores plateau early, the scheduler may reduce the learning rate too aggressively: check
lr_scheduler_patience.
