This repository is the official artifact for the SLE 2026 paper "How much does an LLM know about my programming language?".
The framework systematically evaluates LLM proficiency in a programming language using ANTLR-based syntactic analysis: it extracts keywords and grammar rules from LLM-generated code and measures how much of the language's vocabulary and structure a model actually uses — even in code that fails to compile.
Key features:
- Granular metrics: Measures Keyword/Rule Coverage and Production Validity.
- Partial knowledge capture: Analyzes syntactically invalid code to identify where a model's knowledge fails.
- Extensible: Designed to work with any language that has an ANTLR grammar. → Adding a New Language
- Automatic reporting: Generates a complete language coverage report as a Jupyter Notebook.
The following six languages are fully supported out of the box:
| Language | Grammar style |
|---|---|
| OCL | Single combined grammar (OCL.g4) |
| PlantUML | Split lexer + parser (PlantUMLLexer.g4 + PlantUML.g4) |
| Python 3 | Split lexer + parser (Python3Lexer.g4 + Python3Parser.g4) |
| R | Single grammar with filter (R.g4 + RFilter.g4) |
| Lua | Split lexer + parser (LuaLexer.g4 + LuaParser.g4) |
| SQL | Split lexer + parser (MariaDBLexer.g4 + MariaDBParser.g4) |
Want to add a new language or new LLM models? See Adding a New Language and Extending the Framework.
measuring-llm-knowledge/
├── experiment/
│ ├── main.py — analysis entry point (produces results + notebooks)
│ └── generate.py — LLM code generation entry point (requires API keys)
├── configurations/
│ └── sle_paper.json — experiment configuration
├── run/ — per-language analysis runners
├── core/ — shared analysis library
├── languages/ — ANTLR grammars + per-language analyzers + visitors and evaluation
├── datasets/ — input datasets (including LLM-generated code and pre-cached HuggingFace data)
├── results/ — output CSVs written at runtime
├── analysis/ — output Jupyter notebooks written at runtime
├── generation/ — LLM generation scripts
├── Dockerfile
├── docker-build.sh
├── docker-run.sh — full run (all languages)
└── docker-run-reduced.sh — kick-the-tires run (OCL only)
Docker is the recommended path for artifact evaluation. It requires no local Python setup and no internet access during the run.
- Docker Desktop ≥ 24.0 (download)
- ~8 GB free disk space (image + outputs)
- ~4 GB RAM
bash docker-build.shThis builds the sle2026-artifact image locally. All dependencies and
datasets are bundled inside; no network access is needed at run time.
bash docker-run-reduced.shRuns the full pipeline for OCL only: analysis → notebook generation → notebook
execution. Confirms the artifact works end-to-end before committing to the full run.
Outputs appear in analysis/ and results/ on your host.
bash docker-run.shRuns the full pipeline for all six languages, executes every per-language
notebook, and then executes analysis/_all_together.ipynb which produces the
paper's main tables (RQ1 and RQ3).
Note:
_all_together.ipynbrequires results from all six languages. It is executed automatically only when running--language ALL.
Both scripts mount two host directories so outputs are accessible after the run:
| Host path | Contents |
|---|---|
./results/ |
Raw feature CSVs per language and dataset |
./analysis/ |
Executed Jupyter notebooks + paper figures (.pdf) + paper tables (.json) |
The following files in analysis/ correspond directly to paper figures and
tables, generated when the full run completes:
| File | Content | Paper location |
|---|---|---|
analysis/4_usage_ocl.pdf |
Keyword usage heatmap — OCL | Figure 2 |
analysis/4_usage_python.pdf |
Keyword usage heatmap — Python 3 | Figure 3 |
analysis/rq1_table.json |
Coverage and correctness metrics — all languages, LLM-generated datasets | Table 2 |
analysis/rq3_table.json |
Coverage and correctness metrics — baseline (non-generated) datasets | Table 3 |
All other figures (coverage bar charts, frequency heatmaps, variance plots,
rule-usage heatmaps) are rendered inline inside the per-language notebooks
(analysis/ocl.ipynb, analysis/python3.ipynb, etc.).
Use this path to run individual languages, inspect intermediate results, or re-run only the notebooks after a previous analysis.
This path also allows you to extend the framework with new languages or models. → Extending the Framework
pip install -r requirements-docker.txtexperiment/main.py (analysis) requires no API keys — it only reads
pre-included datasets.
experiment/generate.py (LLM code generation) requires API keys for
cloud-backed models. Set them as environment variables before running:
# Required for gpt-4o-mini and gpt-4o
export OPENAI_API_KEY='your-openai-key-here'
# Required for DeepSeek-V3.1 and Llama-3.3-70B-Instruct (Azure AI)
export AZUREAI_ENDPOINT_URL='your-azure-endpoint-here'
export AZUREAI_API_KEY='your-azure-key-here'Local models (deepseek-coder:6.7b, llama3.1:latest) are served via
Ollama and require no API key. If you need to adapt to your use case, you just have to edit the /core/llm_prompting.py file.
Runs keyword and syntactic analysis for one or all languages, writes results
to results/, generates analysis notebooks in analysis/, and executes them.
python experiment/main.py [options]
| Argument | Description | Default |
|---|---|---|
--language LANG |
Language to run: lua | ocl | plantuml | python3 | r | sql | ALL |
(required unless --all-together-notebook) |
--config PATH |
Path to the JSON experiment configuration | configurations/sle_paper.json |
--notebook-only |
Skip analysis; only execute the notebook(s) for the specified language(s) | off |
--all-together-notebook |
Skip language analysis; only execute analysis/_all_together.ipynb |
off |
Examples:
# Run analysis for OCL and execute its notebook
python experiment/main.py --language ocl
# Run all languages, then execute all notebooks including _all_together.ipynb
python experiment/main.py --language ALL
# Only re-execute a previously generated notebook
python experiment/main.py --language python3 --notebook-only
# Only re-execute the combined results notebook (requires all language results to exist)
python experiment/main.py --all-together-notebookMust be run from the project root (e.g.
python experiment/main.py).
Generates LLM code completions for any of the six supported languages and
saves them to generation/results/. Results must be manually copied to
the appropriate datasets/ subfolder before running main.py.1
To add support for additional LLM models, see Extending the Framework.
python experiment/generate.py [options]
| Argument | Description | Default |
|---|---|---|
--language LANG |
Language to generate for: ocl | plantuml | python3 | sql | lua | r | ALL |
(required) |
--models MODEL [MODEL ...] |
One or more models to use (space-separated) | All six models |
Supported models:
| Model | Provider |
|---|---|
gpt-4o-mini |
OpenAI (requires OPENAI_API_KEY) |
gpt-4o |
OpenAI (requires OPENAI_API_KEY) |
DeepSeek-V3.1 |
Azure AI (requires AZUREAI_API_KEY) |
Llama-3.3-70B-Instruct |
Azure AI (requires AZUREAI_API_KEY) |
deepseek-coder:6.7b |
Ollama (local, no key required) |
llama3.1:latest |
Ollama (local, no key required) |
Examples:
# Generate OCL code with all models
python experiment/generate.py --language ocl
# Generate SQL code with two specific models
python experiment/generate.py --language sql --models gpt-4o-mini gpt-4o
# Generate for all languages
python experiment/generate.py --language ALLGeneration output paths and their datasets/ targets:1
| Language | Generated to (generation/results/) |
Copy to (datasets/) |
|---|---|---|
| OCL | OCL_ds_generated/ |
OCL_ds_generated/ |
| PlantUML | PlantUML_GoldenUMLmodelset_gentest/<model>/ |
PlantUML_GoldenUMLmodelset_gentest/<model>/ |
| Python 3 | python_humanEval/ and python_mbpp/ |
python_humanEval/ and python_mbpp/ |
| SQL | sql_spider_generated/ |
sql_spider_generated/ |
| Lua | lua_multipl-e_generated/humanEval/ and lua_multipl-e_generated/mbpp/ |
lua_multipl-e_generated/humanEval/ and lua_multipl-e_generated/mbpp/ |
| R | r_multipl-e_generated/humanEval/ and r_multipl-e_generated/mbpp/ |
r_multipl-e_generated/humanEval/ and r_multipl-e_generated/mbpp/ |
The framework is designed to work with any language that can be described by an ANTLR grammar. The supported languages cover a range of grammar styles — you can use any of them as a starting point:
| Grammar style | Example in this project |
|---|---|
| Single combined grammar | OCL (languages/ocl/parser/OCL.g4) |
| Split lexer + parser | Lua (LuaLexer.g4 + LuaParser.g4), SQL (MariaDBLexer.g4 + MariaDBParser.g4), PlantUML (PlantUMLLexer.g4 + PlantUML.g4), Python 3 (Python3Lexer.g4 + Python3Parser.g4) |
| Grammar with auxiliary filter | R (R.g4 + RFilter.g4) |
| External ANTLR meta-grammar | ANTLR itself (languages/antlr/parser/ANTLRv4Lexer.g4 + ANTLRv4Parser.g4) |
Steps to add a new language:
- Grammar: Place your
.g4file(s) and the ANTLR-generated Python artifacts (.py,.interp,.tokens) underlanguages/<your_lang>/parser/. Thousands of ready-to-use grammars are available at the ANTLR Grammars repository. - Analyzer: Create
languages/<your_lang>/analyzer.pyandlib.pybased on an existing language (e.g.languages/ocl/). Adapt the root rule name and any language-specific parametrization (e.g., initial rule of the grammar). Additionally, you need to adapt the visitors (KeywordFrequencyVisitor.py, SequenceVisitor.py and SyntacticAnalisisVisitor.py) according to your language lexer and parser names. - Scope: This step may be performed automatically by gpt-4o-mini, which requires OpenAI API key defined, as mentioned earlier. However, you can manually define
excluded_keywordsandgrouped_keywordsinline in your run script, or in JSON files underlanguages/<your_lang>/keywords/, to narrow the analysis to a relevant subset of the grammar. - Run script: Create
run/<your_lang>.pyextendingRunAnalysis. Point it to your dataset and implementextract_data_from_grammar()andrun_analysis(). If you can execute this step, you already have everything you need to complete your language analysis. The following step is there to complete the /experiment/main.py script with your language setup. - Configuration (optional): Add a new entry to
configurations/sle_paper.jsonfollowing the schema of the existing languages.
The framework is intentionally open-ended. Without major modifications, you can:
Add new LLM models — edit core/llm_factory.py to route your model name
to the appropriate provider, and add a provider class in
core/llm_prompting.py if needed. OpenAI-compatible APIs (including local
servers such as LM Studio) can be added by subclassing LLMPrompting. Then
pass the new model name via experiment/generate.py --models <your_model>.
Add new languages — follow the steps in Adding a New Language. Any language with an ANTLR grammar can be integrated.
Add new datasets — add an entry to configurations/sle_paper.json and
place the data file in datasets/. Supported formats: Parquet, JSON, CSV,
CSV folder, or HuggingFace cache.
All LLM-generated code datasets used in the paper are included in the
datasets/ folder. The pre-cached HuggingFace datasets required for Python 3
analysis (datasets/hf_cache/) are also included so no internet access is
needed at run time.
For large or third-party baseline datasets (e.g. MultiPL-T), refer to their respective original sources if you need to reproduce or extend the baseline analysis to suit your language.
Footnotes
-
After generation, file names must be manually renamed to match the unified naming convention used across the project (e.g.
DeepSeek-V3.1_spider.csv→deepseek.csv). This convention unifies model and dataset identifiers across all languages and is required formain.pyto locate the files correctly. ↩ ↩2