This project explores the intersection of Reinforcement Learning (RL) and Large Language Models (LLMs) in complex, imperfect-information environments (Gin Rummy). It addresses the challenge of training RL agents without reliable opponents or expensive human feedback by establishing an adversarial co-evolutionary loop.
We utilize LLMs (Llama 3, Gemma, GPT) as zero-shot strategic opponents to guide the training of efficient PPO agents. The system employs a 3-phase curriculum learning approach to distill the broad, "common-sense" strategic knowledge of LLMs into a fast, compact RL policy.
- High-Performance RL Pipeline: Engineered a high-throughput, 64-96 core, multi-process PPO training pipeline with a custom action-masked policy using Stable Baselines 3 and PyTorch.
- Curriculum Learning System: Built a robust 3-phase curriculum learning system (Random → Self-Play → Adversarial) with a fully cached RAM model-pool API, achieving a 99.12% win rate vs. baseline agents.
- LLM Knowledge Distillation: Architected a scalable API framework to integrate LLM strategic insights as a policy prior, enabling agents to learn from models like Llama 3 and GPT-OSS via Ollama/HuggingFace.
- Interactive Evaluation Suite: Designed and built a custom evaluation environment (PettingZoo) and Web UI for critical live human-vs-agent testing and qualitative validation of learned strategies.
The project consists of three main components:
- The RL Agent (PPO): A custom implementation of Proximal Policy Optimization with valid action masking, trained to handle the partial observability of Gin Rummy.
- The LLM Agent: A sophisticated wrapper that parses game states into text prompts (Chain-of-Thought) and parses LLM responses back into valid game actions.
- The Orchestrator: Manages the training curriculum, switching opponents between random agents, prior model checkpoints, and live LLM inferences based on training progress.
.
├── agents/ # Agent implementations (PPO, Random, LLM, Human)
├── artifacts/ # Trained models and checkpoints
├── config/ # Configuration files (paths, prompts.yaml)
├── controller/ # Game logic and orchestration
├── game/ # Gin Rummy environment wrappers and assets
├── llm/ # API handlers for Ollama/HuggingFace interaction
├── src/ # Utilities, logging, and UI components
├── templates/ # HTML templates for the Web UI
├── app.py # Flask application for web-based play
├── eval.py # Evaluation scripts
├── main.py # Main entry point
├── ppo_train.py # PPO training pipeline script
├── environment.yml # Conda environment definition
└── requirements.txt # Python dependencies
- Python 3.10+
- Conda (recommended)
- Ollama (for local LLM inference)
-
Clone the repository:
git clone [https://github.com/nikelroid/adversarial-coevolution.git](https://github.com/nikelroid/adversarial-coevolution.git) cd adversarial-coevolution -
Create the environment:
conda env create -f environment.yml conda activate rl-llm-env
Alternatively, using pip:
pip install -r requirements.txt pip install -e .
To start the PPO training pipeline with the default configuration (Curriculum Phase 1 & 2):
python ppo_train.pyCheck config/ to adjust hyperparameters or curriculum stages.
Ensure your Ollama server is running (default port 11434). To test an LLM agent:
python llm_test.pyLaunch the web application to play against the trained models:
python app.pyOpen your browser at http://localhost:5000.
To benchmark the current model against a random agent or an LLM:
python eval.py --model artifacts/models/ppo_gin_rummy/ppo_gin_rummy_final.zip| Agent Type | Opponent | Win Rate | Notes |
|---|---|---|---|
| PPO (Baseline) | Random | 98.9% | High win rate, but prone to local optima (Gin-biased). |
| PPO (Curriculum) | Random | 99.1% | Balanced strategy (Knock vs. Gin). |
| GPT-OSS (20B) | Random | 100% | Zero-shot performance (5-0 match). |
| GPT-OSS (20B) | PPO (Knock) | 60% | Competitive match (3-2 score). |
- Nima Kelidari - Lead Engineer & RL Architecture - kelidari@usc.edu
- Mahdi Salmani - LLM Integration & Evaluation - salmanis@usc.edu
- Mohammadsaeed Haghi - Game Environment & API - haghim@usc.edu
This project is licensed under the MIT License - see the LICENSE file for details.
- PettingZoo for the Multi-Agent RL environments.
- Stable-Baselines3 for reliable PPO implementations.
- RLCard for game logic inspiration.
