How much does an LLM know about my programming language?

This repository is the official artifact for the SLE 2026 paper "How much does an LLM know about my programming language?".

The framework systematically evaluates LLM proficiency in a programming language using ANTLR-based syntactic analysis: it extracts keywords and grammar rules from LLM-generated code and measures how much of the language's vocabulary and structure a model actually uses — even in code that fails to compile.

Key features:

Granular metrics: Measures Keyword/Rule Coverage and Production Validity.
Partial knowledge capture: Analyzes syntactically invalid code to identify where a model's knowledge fails.
Extensible: Designed to work with any language that has an ANTLR grammar. → Adding a New Language
Automatic reporting: Generates a complete language coverage report as a Jupyter Notebook.

Supported Languages

The following six languages are fully supported out of the box:

Language	Grammar style
OCL	Single combined grammar (`OCL.g4`)
PlantUML	Split lexer + parser (`PlantUMLLexer.g4` + `PlantUML.g4`)
Python 3	Split lexer + parser (`Python3Lexer.g4` + `Python3Parser.g4`)
R	Single grammar with filter (`R.g4` + `RFilter.g4`)
Lua	Split lexer + parser (`LuaLexer.g4` + `LuaParser.g4`)
SQL	Split lexer + parser (`MariaDBLexer.g4` + `MariaDBParser.g4`)

Want to add a new language or new LLM models? See Adding a New Language and Extending the Framework.

Artifact Structure

measuring-llm-knowledge/
├── experiment/
│   ├── main.py          — analysis entry point (produces results + notebooks)
│   └── generate.py      — LLM code generation entry point (requires API keys)
├── configurations/
│   └── sle_paper.json   — experiment configuration
├── run/                 — per-language analysis runners
├── core/                — shared analysis library
├── languages/           — ANTLR grammars + per-language analyzers + visitors and evaluation
├── datasets/            — input datasets (including LLM-generated code and pre-cached HuggingFace data)
├── results/             — output CSVs written at runtime
├── analysis/            — output Jupyter notebooks written at runtime
├── generation/          — LLM generation scripts
├── Dockerfile
├── docker-build.sh
├── docker-run.sh        — full run (all languages)
└── docker-run-reduced.sh — kick-the-tires run (OCL only)

Reproducing the Paper Results (Docker — Recommended)

Docker is the recommended path for artifact evaluation. It requires no local Python setup and no internet access during the run.

System Requirements

Docker Desktop ≥ 24.0 (download)
~8 GB free disk space (image + outputs)
~4 GB RAM

Step 1 — Build the image

bash docker-build.sh

This builds the sle2026-artifact image locally. All dependencies and datasets are bundled inside; no network access is needed at run time.

Step 2 — Kick-the-tires (OCL only)

bash docker-run-reduced.sh

Runs the full pipeline for OCL only: analysis → notebook generation → notebook execution. Confirms the artifact works end-to-end before committing to the full run. Outputs appear in analysis/ and results/ on your host.

Step 3 — Full run (all languages)

bash docker-run.sh

Runs the full pipeline for all six languages, executes every per-language notebook, and then executes analysis/_all_together.ipynb which produces the paper's main tables (RQ1 and RQ3).

Note: _all_together.ipynb requires results from all six languages. It is executed automatically only when running --language ALL.

Both scripts mount two host directories so outputs are accessible after the run:

Host path	Contents
`./results/`	Raw feature CSVs per language and dataset
`./analysis/`	Executed Jupyter notebooks + paper figures (`.pdf`) + paper tables (`.json`)

Paper Outputs

The following files in analysis/ correspond directly to paper figures and tables, generated when the full run completes:

File	Content	Paper location
`analysis/4_usage_ocl.pdf`	Keyword usage heatmap — OCL	Figure 2
`analysis/4_usage_python.pdf`	Keyword usage heatmap — Python 3	Figure 3
`analysis/rq1_table.json`	Coverage and correctness metrics — all languages, LLM-generated datasets	Table 2
`analysis/rq3_table.json`	Coverage and correctness metrics — baseline (non-generated) datasets	Table 3

All other figures (coverage bar charts, frequency heatmaps, variance plots, rule-usage heatmaps) are rendered inline inside the per-language notebooks (analysis/ocl.ipynb, analysis/python3.ipynb, etc.).

Running Without Docker

Use this path to run individual languages, inspect intermediate results, or re-run only the notebooks after a previous analysis.

This path also allows you to extend the framework with new languages or models. → Extending the Framework

Installation

pip install -r requirements-docker.txt

API Keys

experiment/main.py (analysis) requires no API keys — it only reads pre-included datasets.

experiment/generate.py (LLM code generation) requires API keys for cloud-backed models. Set them as environment variables before running:

# Required for gpt-4o-mini and gpt-4o
export OPENAI_API_KEY='your-openai-key-here'

# Required for DeepSeek-V3.1 and Llama-3.3-70B-Instruct (Azure AI)
export AZUREAI_ENDPOINT_URL='your-azure-endpoint-here'
export AZUREAI_API_KEY='your-azure-key-here'

Local models (deepseek-coder:6.7b, llama3.1:latest) are served via Ollama and require no API key. If you need to adapt to your use case, you just have to edit the /core/llm_prompting.py file.

`experiment/main.py` — Analysis Entry Point

Runs keyword and syntactic analysis for one or all languages, writes results to results/, generates analysis notebooks in analysis/, and executes them.

python experiment/main.py [options]

Argument	Description	Default
`--language LANG`	Language to run: `lua` \| `ocl` \| `plantuml` \| `python3` \| `r` \| `sql` \| `ALL`	(required unless `--all-together-notebook`)
`--config PATH`	Path to the JSON experiment configuration	`configurations/sle_paper.json`
`--notebook-only`	Skip analysis; only execute the notebook(s) for the specified language(s)	off
`--all-together-notebook`	Skip language analysis; only execute `analysis/_all_together.ipynb`	off

Examples:

# Run analysis for OCL and execute its notebook
python experiment/main.py --language ocl

# Run all languages, then execute all notebooks including _all_together.ipynb
python experiment/main.py --language ALL

# Only re-execute a previously generated notebook
python experiment/main.py --language python3 --notebook-only

# Only re-execute the combined results notebook (requires all language results to exist)
python experiment/main.py --all-together-notebook

Must be run from the project root (e.g. python experiment/main.py).

`experiment/generate.py` — LLM Code Generation Entry Point

Generates LLM code completions for any of the six supported languages and saves them to generation/results/. Results must be manually copied to the appropriate datasets/ subfolder before running main.py.¹

To add support for additional LLM models, see Extending the Framework.

python experiment/generate.py [options]

Argument	Description	Default
`--language LANG`	Language to generate for: `ocl` \| `plantuml` \| `python3` \| `sql` \| `lua` \| `r` \| `ALL`	(required)
`--models MODEL [MODEL ...]`	One or more models to use (space-separated)	All six models

Supported models:

Model	Provider
`gpt-4o-mini`	OpenAI (requires `OPENAI_API_KEY`)
`gpt-4o`	OpenAI (requires `OPENAI_API_KEY`)
`DeepSeek-V3.1`	Azure AI (requires `AZUREAI_API_KEY`)
`Llama-3.3-70B-Instruct`	Azure AI (requires `AZUREAI_API_KEY`)
`deepseek-coder:6.7b`	Ollama (local, no key required)
`llama3.1:latest`	Ollama (local, no key required)

Examples:

# Generate OCL code with all models
python experiment/generate.py --language ocl

# Generate SQL code with two specific models
python experiment/generate.py --language sql --models gpt-4o-mini gpt-4o

# Generate for all languages
python experiment/generate.py --language ALL

Generation output paths and their datasets/ targets:¹

Language	Generated to (`generation/results/`)	Copy to (`datasets/`)
OCL	`OCL_ds_generated/`	`OCL_ds_generated/`
PlantUML	`PlantUML_GoldenUMLmodelset_gentest/<model>/`	`PlantUML_GoldenUMLmodelset_gentest/<model>/`
Python 3	`python_humanEval/` and `python_mbpp/`	`python_humanEval/` and `python_mbpp/`
SQL	`sql_spider_generated/`	`sql_spider_generated/`
Lua	`lua_multipl-e_generated/humanEval/` and `lua_multipl-e_generated/mbpp/`	`lua_multipl-e_generated/humanEval/` and `lua_multipl-e_generated/mbpp/`
R	`r_multipl-e_generated/humanEval/` and `r_multipl-e_generated/mbpp/`	`r_multipl-e_generated/humanEval/` and `r_multipl-e_generated/mbpp/`

Adding a New Language

The framework is designed to work with any language that can be described by an ANTLR grammar. The supported languages cover a range of grammar styles — you can use any of them as a starting point:

Grammar style	Example in this project
Single combined grammar	OCL (`languages/ocl/parser/OCL.g4`)
Split lexer + parser	Lua (`LuaLexer.g4` + `LuaParser.g4`), SQL (`MariaDBLexer.g4` + `MariaDBParser.g4`), PlantUML (`PlantUMLLexer.g4` + `PlantUML.g4`), Python 3 (`Python3Lexer.g4` + `Python3Parser.g4`)
Grammar with auxiliary filter	R (`R.g4` + `RFilter.g4`)
External ANTLR meta-grammar	ANTLR itself (`languages/antlr/parser/ANTLRv4Lexer.g4` + `ANTLRv4Parser.g4`)

Steps to add a new language:

Grammar: Place your .g4 file(s) and the ANTLR-generated Python artifacts (.py, .interp, .tokens) under languages/<your_lang>/parser/. Thousands of ready-to-use grammars are available at the ANTLR Grammars repository.
Analyzer: Create languages/<your_lang>/analyzer.py and lib.py based on an existing language (e.g. languages/ocl/). Adapt the root rule name and any language-specific parametrization (e.g., initial rule of the grammar). Additionally, you need to adapt the visitors (KeywordFrequencyVisitor.py, SequenceVisitor.py and SyntacticAnalisisVisitor.py) according to your language lexer and parser names.
Scope: This step may be performed automatically by gpt-4o-mini, which requires OpenAI API key defined, as mentioned earlier. However, you can manually define excluded_keywords and grouped_keywords inline in your run script, or in JSON files under languages/<your_lang>/keywords/, to narrow the analysis to a relevant subset of the grammar.
Run script: Create run/<your_lang>.py extending RunAnalysis. Point it to your dataset and implement extract_data_from_grammar() and run_analysis(). If you can execute this step, you already have everything you need to complete your language analysis. The following step is there to complete the /experiment/main.py script with your language setup.
Configuration (optional): Add a new entry to configurations/sle_paper.json following the schema of the existing languages.

Extending the Framework

The framework is intentionally open-ended. Without major modifications, you can:

Add new LLM models — edit core/llm_factory.py to route your model name to the appropriate provider, and add a provider class in core/llm_prompting.py if needed. OpenAI-compatible APIs (including local servers such as LM Studio) can be added by subclassing LLMPrompting. Then pass the new model name via experiment/generate.py --models <your_model>.

Add new languages — follow the steps in Adding a New Language. Any language with an ANTLR grammar can be integrated.

Add new datasets — add an entry to configurations/sle_paper.json and place the data file in datasets/. Supported formats: Parquet, JSON, CSV, CSV folder, or HuggingFace cache.

Note on Datasets

All LLM-generated code datasets used in the paper are included in the datasets/ folder. The pre-cached HuggingFace datasets required for Python 3 analysis (datasets/hf_cache/) are also included so no internet access is needed at run time.

For large or third-party baseline datasets (e.g. MultiPL-T), refer to their respective original sources if you need to reproduce or extend the baseline analysis to suit your language.

After generation, file names must be manually renamed to match the unified naming convention used across the project (e.g. DeepSeek-V3.1_spider.csv → deepseek.csv). This convention unifies model and dataset identifiers across all languages and is required for main.py to locate the files correctly. ↩ ↩²

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How much does an LLM know about my programming language?

Supported Languages

Artifact Structure

Reproducing the Paper Results (Docker — Recommended)

System Requirements

Step 1 — Build the image

Step 2 — Kick-the-tires (OCL only)

Step 3 — Full run (all languages)

Paper Outputs

Running Without Docker

Installation

API Keys

`experiment/main.py` — Analysis Entry Point

`experiment/generate.py` — LLM Code Generation Entry Point

Adding a New Language

Extending the Framework

Note on Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
analysis		analysis
configurations		configurations
core		core
datasets		datasets
experiment		experiment
generation		generation
languages		languages
run		run
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-build.sh		docker-build.sh
docker-run-reduced.sh		docker-run-reduced.sh
docker-run.sh		docker-run.sh
requirements-docker.txt		requirements-docker.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

How much does an LLM know about my programming language?

Supported Languages

Artifact Structure

Reproducing the Paper Results (Docker — Recommended)

System Requirements

Step 1 — Build the image

Step 2 — Kick-the-tires (OCL only)

Step 3 — Full run (all languages)

Paper Outputs

Running Without Docker

Installation

API Keys

experiment/main.py — Analysis Entry Point

experiment/generate.py — LLM Code Generation Entry Point

Adding a New Language

Extending the Framework

Note on Datasets

Footnotes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`experiment/main.py` — Analysis Entry Point

`experiment/generate.py` — LLM Code Generation Entry Point

Packages