Turn messy copied job ads into clean, structured markdown with AI.
jobs/
raw/ # input job ads
processed/ # generated structured outputs
src/
prompt.md # extraction instructions
run.py # local/CI runner
.github/workflows/main.yml
The sample job ad is fully synthetic and intentionally noisy (duplicates, UI text, etc.) to simulate real copy-paste from job platforms.
Activate env and install dependencies:
python3 -m venv .venv
source .venv/bin/activate
pip install openai
pip freeze > requirements.txtSet your API key and run the processor:
export OPENAI_API_KEY=your_key_here
python3 src/run.py jobs/raw/job1.mdProcess multiple specific files:
python3 src/run.py jobs/raw/job1.md jobs/raw/job2.mdOptional environment variables:
OPENAI_MODELto override the default model (gpt-5.4-mini)OPENAI_BASE_URLto point to a compatible API base URL
Each generated file is written to jobs/processed/ and contains:
- source metadata
- processing timestamp
- model name
- deterministic structured JSON
The markdown is generated from the structured JSON, which is the source of truth.
Output filenames mirror the input stem, for example jobs/raw/job1.txt becomes jobs/processed/job1.md.
Reprocessing the same input overwrites that file.
If no input files are passed to src/run.py, nothing is processed.
The GitHub Actions workflow:
- triggers on pushes affecting
jobs/raw/** - can also be started manually with
workflow_dispatch - detects changed raw files
- ensures only newly added or modified job ads are processed, avoiding unnecessary API calls
- runs
src/run.pyonce per changed file - commits generated files in
jobs/processed/back to the repository
To enable CI, add OPENAI_API_KEY as a GitHub Actions secret.
- JSON is the source of truth
- Markdown is generated for readability
- One input file → one deterministic output file
- No database, no UI, local-first workflow
- CI handles orchestration, not business logic