GlossAPI

GlossAPI is a GPU-ready document processing pipeline from GFOSS that turns academic PDFs into structured Markdown, cleans noisy text with Rust extensions, and optionally enriches math/code content.

Why GlossAPI

Handles download → extraction → cleaning → sectioning in one pipeline.
Ships safe PyPDFium extraction plus Docling/RapidOCR for high-throughput OCR.
Rust-powered cleaner/noise metrics keep Markdown quality predictable.
Greek-first metadata and section classification tuned for academic corpora.
Modular Corpus API lets you resume from any stage or plug into existing flows.

Quickstart (local repo)

git clone https://github.com/eellak/glossAPI.git
cd glossAPI
python -m venv .venv && source .venv/bin/activate
pip install -e .

# Run the lightweight PDF corpus (no GPU/Docling required)
python - <<'PY'
from pathlib import Path
from glossapi import Corpus

input_dir = Path("samples/lightweight_pdf_corpus/pdfs")
output_dir = Path("artifacts/lightweight_pdf_run")
output_dir.mkdir(parents=True, exist_ok=True)

corpus = Corpus(input_dir, output_dir)
corpus.extract(input_format="pdf")  # Safe PyPDFium backend by default
PY

Compare the generated Markdown in artifacts/lightweight_pdf_run/markdown/ with samples/lightweight_pdf_corpus/expected_outputs.json for a fast smoke check.
Rebuild the corpus anytime with python samples/lightweight_pdf_corpus/generate_pdfs.py.

Corpus usage contract

Corpus is the organizing surface: keep contributions wired through the phase methods (download(), extract(), clean(), ocr(), section(), annotate(), export/jsonl*()). The intended use is a short script chaining those calls; avoid bespoke monkeypatches or side channels so resumability and artifact layout stay consistent.

Automated Environment Profiles

Use dependency_setup/setup_glossapi.sh to provision a virtualenv with the right dependency stack for the three supported modes:

# Vanilla pipeline (no GPU OCR extras)
./dependency_setup/setup_glossapi.sh --mode vanilla --venv dependency_setup/.venvs/vanilla --run-tests

# Docling + RapidOCR mode
./dependency_setup/setup_glossapi.sh --mode rapidocr --venv dependency_setup/.venvs/rapidocr --run-tests

# DeepSeek OCR mode (requires weights under /path/to/deepseek-ocr/DeepSeek-OCR)
./dependency_setup/setup_glossapi.sh \
  --mode deepseek \
  --venv dependency_setup/.venvs/deepseek \
  --weights-dir /path/to/deepseek-ocr \
  --run-tests --smoke-test

Pass --download-deepseek if you need the script to fetch weights automatically; otherwise it looks for ${REPO_ROOT}/deepseek-ocr/DeepSeek-OCR unless you override --weights-dir. Check dependency_setup/dependency_notes.md for the latest pins, caveats, and validation history. The script also installs the Rust extensions in editable mode so local changes are picked up immediately.

DeepSeek runtime checklist

Run python -m glossapi.ocr.deepseek.preflight (from your DeepSeek venv) to fail fast if the CLI would fall back to the stub.
Export these to force the real CLI and avoid silent stub output:
- GLOSSAPI_DEEPSEEK_ALLOW_CLI=1
- GLOSSAPI_DEEPSEEK_ALLOW_STUB=0
- GLOSSAPI_DEEPSEEK_VLLM_SCRIPT=/path/to/deepseek-ocr/run_pdf_ocr_vllm.py
- GLOSSAPI_DEEPSEEK_TEST_PYTHON=/path/to/deepseek/venv/bin/python
- GLOSSAPI_DEEPSEEK_MODEL_DIR=/path/to/deepseek-ocr/DeepSeek-OCR
- GLOSSAPI_DEEPSEEK_LD_LIBRARY_PATH=/path/to/libjpeg-turbo/lib
CUDA toolkit with nvcc available (FlashInfer/vLLM JIT falls back poorly without it); set CUDA_HOME and prepend $CUDA_HOME/bin to PATH.
If FlashInfer is problematic, disable with VLLM_USE_FLASHINFER=0 and FLASHINFER_DISABLE=1.
To avoid FP8 KV cache issues, export GLOSSAPI_DEEPSEEK_NO_FP8_KV=1 (propagates --no-fp8-kv).
Tune VRAM use via GLOSSAPI_DEEPSEEK_GPU_MEMORY_UTILIZATION=<0.5–0.9>.

Choose Your Install Path

Scenario	Commands	Notes
Pip users	`pip install glossapi`	Fast vanilla evaluation with minimal dependencies.
Mode automation (recommended)	`./dependency_setup/setup_glossapi.sh --mode {vanilla\|rapidocr\|deepseek}`	Creates an isolated venv per mode, installs Rust crates, and can run the relevant pytest subset.
Manual editable install	`pip install -e .` after cloning	Keep this if you prefer to manage dependencies by hand.
Conda-based stacks	`scripts/setup_conda.sh`	Provisions Python 3.10 env + Rust + editable install for Amazon Linux/SageMaker.

See the refreshed docs (docs/index.md) for detailed environment notes, CUDA/ORT combinations, and troubleshooting tips.

Repo Landmarks

docs/code_map.md: fast map from pipeline ideas to implementing classes and files.
docs/pipeline.md: stage contracts, key parameters, and artifact outputs.
samples/lightweight_pdf_corpus/: 20 one-page PDFs with manifest + expected Markdown.
src/glossapi/: Corpus pipeline, cleaners, and orchestration logic.
tests/test_pipeline_smoke.py: Minimal regression entry point (uses the lightweight corpus).
docs/: MkDocs site with onboarding, pipeline recipes, and configuration guides.

Pipeline map

Use this as the shortest path from a documentation concept to the public call that implements it.

Stage	Main call	Important parameters	Writes
Download	`Corpus.download(...)`	`input_parquet`, `links_column`, `parallelize_by`, downloader kwargs	`downloads/`, `download_results/*.parquet`
Extract (Phase-1)	`Corpus.extract(...)`	`input_format`, `phase1_backend`, `force_ocr`, `use_gpus`, `export_doc_json`, `emit_formula_index`	`markdown/<stem>.md`, `json/<stem>.docling.json(.zst)`, `json/metrics/*.json`
Clean	`Corpus.clean(...)`	`threshold`, `drop_bad`, `empty_char_threshold`, `empty_min_pages`	`clean_markdown/<stem>.md`, updated parquet metrics/flags
OCR / math follow-up	`Corpus.ocr(...)`	`mode`, `fix_bad`, `math_enhance`, `use_gpus`, `devices`	refreshed `markdown/<stem>.md`, optional `json/<stem>.latex_map.jsonl`
Section	`Corpus.section()`	uses cleaner/parquet outputs to choose inputs	`sections/sections_for_annotation.parquet`
Annotate	`Corpus.annotate(...)`	`annotation_type`, `fully_annotate`	`classified_sections.parquet`, `fully_annotated_sections.parquet`
Triage math density	`Corpus.triage_math()`	no required args	updated `download_results/*.parquet` routing columns
JSONL export	`Corpus.jsonl(...)`	`output_path`	merged training/export JSONL

Contributing

Run pytest tests/test_pipeline_smoke.py for a fast end-to-end check.
Regenerate the lightweight corpus via generate_pdfs.py and commit the updated PDFs + manifest together.
Prefer uv or pip editable installs so Rust extensions rebuild locally.

Open an issue or PR if you spot drift between expected outputs and the pipeline, or if you have doc updates for the new Divio skeleton.

License

This project is licensed under the EUPL 1.2.

Name		Name	Last commit message	Last commit date
Latest commit History 302 Commits
Greek_variety_classification		Greek_variety_classification
cleaning_scripts		cleaning_scripts
dependency_setup		dependency_setup
docs		docs
rust		rust
samples/lightweight_pdf_corpus		samples/lightweight_pdf_corpus
src/glossapi		src/glossapi
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
conversion.log		conversion.log
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_script.py		test_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GlossAPI

Why GlossAPI

Quickstart (local repo)

Corpus usage contract

Automated Environment Profiles

Choose Your Install Path

Repo Landmarks

Pipeline map

Contributing

License

About

Uh oh!

Releases 6

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GlossAPI

Why GlossAPI

Quickstart (local repo)

Corpus usage contract

Automated Environment Profiles

Choose Your Install Path

Repo Landmarks

Pipeline map

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Uh oh!

Contributors

Uh oh!

Languages