Validate that supporting text quotes in your data actually appear in their cited references.
This tool fetches scientific publications (currently PubMed/PMC) and verifies that quoted text (supporting_text) can be found in the referenced document using deterministic substring matching.
# Using uv (recommended)
uv pip install linkml-reference-validator
# Using pip
pip install linkml-reference-validator# Validate a single quote against a reference
linkml-reference-validator validate text \
"protein functions in cell cycle regulation" \
PMID:12345678
# Validate a data file using LinkML validation
linkml-validate -s schema.yaml data.yaml \
--validate-plugins linkml_reference_validator.plugins.ReferenceValidationPluginScientific data often includes claims supported by quotes from publications. But how do you know the quotes are accurate?
Before:
gene_function:
gene: TP53
function: "regulates cell cycle"
evidence:
reference: PMID:12345678
supporting_text: "TP53 is critical for cell cycle regulation" # Is this really in the paper?After validation:
$ linkml-reference-validator validate text \
"TP53 is critical for cell cycle regulation" \
PMID:12345678
✓ Valid: True
✓ Supporting text validated successfully in PMID:12345678Note: The CLI was restructured in v1.x to use nested commands (
validate text,validate data,cache reference). The old hyphenated commands (validate-text,validate-data,cache-reference) still work for backward compatibility but are deprecated.
Validate a single quote against a reference without needing a schema.
linkml-reference-validator validate text <TEXT> <REFERENCE_ID> [OPTIONS]Example:
# Basic validation
linkml-reference-validator validate text \
"protein functions in cell cycle regulation" \
PMID:12345678
# With editorial notes (ignored in matching)
linkml-reference-validator validate text \
"protein [X] functions in cell cycle regulation" \
PMID:12345678
# Multi-part quote with omitted text
linkml-reference-validator validate text \
"protein functions ... cell cycle regulation" \
PMID:12345678Options:
--cache-dir PATH- Directory for caching references (default:references_cache)--verbose- Show detailed validation information--help- Show help message
Exit Codes:
0- Validation successful1- Validation failed
Validate an entire data file using a LinkML schema.
linkml-reference-validator validate data <DATA_FILE> --schema <SCHEMA> [OPTIONS]Example:
Schema (gene_schema.yaml):
id: https://example.org/genes
name: gene-schema
classes:
GeneFunction:
attributes:
gene:
range: string
function:
range: string
evidence:
range: Evidence
Evidence:
attributes:
reference:
range: Reference
implements:
- linkml:authoritative_reference # Marks this as a reference field
supporting_text:
range: string
implements:
- linkml:excerpt # Marks this as text to validate
Reference:
attributes:
id:
identifier: true
range: string
title:
range: stringData (gene_data.yaml):
gene: TP53
function: "regulates cell cycle"
evidence:
reference:
id: PMID:12345678
title: "TP53 in cell cycle control"
supporting_text: "TP53 protein functions in cell cycle regulation"Validation:
linkml-reference-validator validate data \
gene_data.yaml \
--schema gene_schema.yaml
# Output:
Validating gene_data.yaml against schema gene_schema.yaml
Cache directory: references_cache
✓ All validations passed!Options:
--schema PATH(required) - Path to LinkML schema--target-class CLASS- Specific class to validate--cache-dir PATH- Directory for caching references--verbose- Show detailed output--help- Show help message
Automatically fix or flag supporting text validation errors based on confidence thresholds.
# Repair a single quote
linkml-reference-validator repair text <TEXT> <REFERENCE_ID> [OPTIONS]
# Repair a data file (dry run by default)
linkml-reference-validator repair data <DATA_FILE> --schema <SCHEMA> [OPTIONS]Example - Single Quote:
# Try to repair a quote with ASCII subscript
linkml-reference-validator repair text "CO2 levels were measured" PMID:12345678
# Output:
# ✓ Repaired successfully
# Original: CO2 levels were measured
# Repaired: CO₂ levels were measured
# Action: CHARACTER_NORMALIZATION
# Confidence: HIGHExample - Data File:
# Dry run - show what would be changed
linkml-reference-validator repair data disease.yaml \
--schema schema.yaml \
--dry-run
# Apply auto-fixes (creates backup)
linkml-reference-validator repair data disease.yaml \
--schema schema.yaml \
--no-dry-run
# Custom output file
linkml-reference-validator repair data disease.yaml \
--schema schema.yaml \
--no-dry-run \
--output repaired.yamlRepair Report Output:
============================================================
Repair Report
============================================================
HIGH CONFIDENCE FIXES (auto-applicable):
PMID:12345678 at evidence[0]:
Character normalization fix
'CO2 levels...' → 'CO₂ levels...'
SUGGESTED FIXES (review recommended):
PMID:23456789 at evidence[1]:
Inserted ellipsis between non-contiguous parts
RECOMMENDED REMOVALS (low confidence):
PMID:34567890 at evidence[2]:
Similarity: 8%
Snippet: 'Fabricated text that...'
------------------------------------------------------------
Summary:
Total items: 5
Already valid: 2
Auto-fixes: 1
Suggestions: 1
Removals: 1
Unverifiable: 0
Repair Strategies:
| Strategy | Confidence | Description |
|---|---|---|
| Character Normalization | HIGH | Fix Unicode/symbol differences (CO2→CO₂, +/-→±) |
| Ellipsis Insertion | MEDIUM | Insert ... between non-contiguous text parts |
| Fuzzy Correction | VARIES | Suggest closest matching text from reference |
| Removal | VERY_LOW | Flag fabricated/not-found text for manual removal |
Options:
--dry-run / --no-dry-run- Show changes without applying (default: dry-run)--auto-fix-threshold FLOAT- Minimum similarity for auto-fixes (default: 0.95)--output PATH- Output file path (default: overwrite with backup)--config PATH- Path to repair configuration file--cache-dir PATH- Directory for caching references--verbose- Show detailed output
Download and cache references for offline use.
linkml-reference-validator cache reference <REFERENCE_ID> [OPTIONS]Example:
# Cache a single reference
linkml-reference-validator cache reference PMID:12345678
# Output:
Fetching PMID:12345678...
✓ Successfully cached PMID:12345678
Title: TP53 in cell cycle control
Authors: Smith J, Doe A, Johnson K
Content type: full_text_xml
Content length: 45231 charactersUse Cases:
- Pre-fetch references before validation
- Build offline reference library
- Verify reference availability
The recommended way to use this tool is as a LinkML validation plugin with the standard linkml-validate command.
1. Install both packages:
uv pip install linkml linkml-reference-validator2. Create your schema with interface markers:
# my_schema.yaml
id: https://example.org/my-schema
name: my-schema
prefixes:
linkml: https://w3id.org/linkml/
classes:
Evidence:
attributes:
reference:
range: Reference
implements:
- linkml:authoritative_reference # <-- This marks it as a reference
supporting_text:
range: string
implements:
- linkml:excerpt # <-- This marks it as text to validate3. Validate using linkml-validate:
linkml-validate \
--schema my_schema.yaml \
--validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin \
my_data.yaml✅ Integrated validation - Combines schema validation + reference validation in one command ✅ Standard LinkML workflow - Uses familiar LinkML tools ✅ Flexible schema design - Works with any schema using the interface pattern ✅ Rich error reporting - Shows exactly where validation fails in your data
reference:
id: PMID:12345678
supporting_text: "protein functions in cells"Fetches:
- Abstract (always)
- Full text from PMC (when available)
- Metadata (title, authors, journal, year, DOI)
ID Formats:
PMID:1234567812345678(assumes PMID)
- DOI -
DOI:10.1038/nature12345 - URLs - Web pages and online documents
For the validator to work, your LinkML schema must:
Use implements: [linkml:authoritative_reference] on slots that contain references:
classes:
Evidence:
attributes:
reference: # Can be nested object
range: Reference
implements:
- linkml:authoritative_referenceOR use a flat structure:
classes:
Evidence:
attributes:
reference_id: # Can be flat string
range: string
implements:
- linkml:authoritative_referenceUse implements: [linkml:excerpt] on slots containing quoted text:
classes:
Evidence:
attributes:
supporting_text: # The quote to validate
range: string
implements:
- linkml:excerptIf using nested references, define the Reference class:
classes:
Reference:
attributes:
id:
identifier: true
range: string
title: # Optional: validates if provided
range: stringevidence:
reference:
id: PMID:12345678
title: "Study of Protein X"
supporting_text: "protein functions in cell cycle regulation"evidence:
reference_id: PMID:12345678
supporting_text: "protein functions in cell cycle regulation"statement:
text: "Protein X has multiple functions"
evidence:
- reference:
id: PMID:11111111
supporting_text: "protein functions in cell cycle"
- reference:
id: PMID:22222222
supporting_text: "protein regulates DNA repair"Use square brackets for editorial insertions that should be ignored during matching:
supporting_text: "protein [X] functions in cell cycle regulation"
# Matches: "protein functions in cell cycle regulation"
# Ignores: "X"Use cases:
[sic]- Original spelling[emphasis added]- Added emphasis[gene name]- Clarifications[...]- Omitted content markers
Use ellipsis for gaps in quoted text:
supporting_text: "protein functions ... in cell cycle regulation"
# Matches both parts independently:
# - "protein functions"
# - "in cell cycle regulation"Requirements:
- Both parts must appear in the reference (order independent)
- Each part must be a substring match after normalization
Before matching, text is normalized:
- Greek letters spelled out (α→alpha, β→beta, etc.)
- Lowercased
- Punctuation removed
- Extra whitespace collapsed
Examples:
"T-Cell Receptor" → "t cell receptor"
"TP53 (p53) protein" → "tp53 p53 protein"
"α-catenin" → "alpha catenin"
"β-actin" → "beta actin"
"γ-tubulin" → "gamma tubulin"
Greek Letter Support:
All Greek letters (both uppercase and lowercase) are converted to their spelled-out English equivalents. This ensures:
- Bidirectional matching: "α-catenin" in a query matches "alpha-catenin" in the reference, and vice versa
- Preserved distinctions: "α-catenin" and "β-catenin" remain distinct (not collapsed to just "catenin")
- Consistent behavior: Works with any Greek letter commonly used in biomedical nomenclature
References are automatically cached to disk to:
- Speed up repeated validations
- Reduce API calls to PubMed
- Enable offline validation
references_cache/
├── PMID_12345678.md
├── PMID_98765432.md
└── PMC_7654321.md
Cache files are stored as Markdown with YAML frontmatter for easy readability and compatibility:
---
reference_id: PMID:12345678
title: TP53 in cell cycle control
authors:
- Smith J
- Doe A
- Johnson K
journal: Nature
year: '2024'
doi: 10.1038/nature12345
content_type: full_text_xml
---
# TP53 in cell cycle control
**Authors:** Smith J, Doe A, Johnson K
**Journal:** Nature (2024)
**DOI:** [10.1038/nature12345](https://doi.org/10.1038/nature12345)
## Content
[Full text content follows...]Note: The validator still supports reading legacy .txt format cache files for backward compatibility.
# Use custom cache directory
linkml-reference-validator validate text \
"quote" PMID:123 \
--cache-dir /path/to/cache
# Pre-cache references
linkml-reference-validator cache reference PMID:12345678
# Force re-fetch (bypass cache)
linkml-reference-validator cache reference PMID:12345678 --forceSchema (gene.yaml):
id: https://example.org/genes
name: gene-schema
classes:
GeneFunctionStatement:
tree_root: true
attributes:
gene_symbol:
range: string
function_description:
range: string
evidence:
range: Evidence
Evidence:
attributes:
reference:
range: Reference
implements:
- linkml:authoritative_reference
supporting_text:
range: string
implements:
- linkml:excerpt
Reference:
attributes:
id:
identifier: trueData (tp53.yaml):
gene_symbol: TP53
function_description: "tumor suppressor"
evidence:
reference:
id: PMID:12345678
supporting_text: "TP53 functions as a tumor suppressor"Validation:
linkml-validate \
--schema gene.yaml \
--validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin \
tp53.yaml# Check if a quote is in a paper
linkml-reference-validator validate text \
"protein kinase activity regulates cell proliferation" \
PMID:12345678
# With editorial note
linkml-reference-validator validate text \
"protein kinase [PKA] activity regulates cell proliferation" \
PMID:12345678
# Multi-part quote
linkml-reference-validator validate text \
"protein kinase activity ... regulates cell proliferation" \
PMID:12345678Data (gene_annotations.yaml):
- gene_symbol: BRCA1
annotations:
- function: "DNA repair"
evidence:
reference:
id: PMID:11111111
supporting_text: "BRCA1 plays a critical role in DNA repair"
- function: "tumor suppressor"
evidence:
reference:
id: PMID:22222222
supporting_text: "BRCA1 functions as a tumor suppressor"
- gene_symbol: TP53
annotations:
- function: "cell cycle regulation"
evidence:
reference:
id: PMID:33333333
supporting_text: "TP53 regulates cell cycle checkpoints"Validation:
linkml-reference-validator validate data \
gene_annotations.yaml \
--schema gene_schema.yaml \
--verbose
# Output shows validation for each reference:
# ✓ PMID:11111111 - "BRCA1 plays a critical role in DNA repair"
# ✓ PMID:22222222 - "BRCA1 functions as a tumor suppressor"
# ✓ PMID:33333333 - "TP53 regulates cell cycle checkpoints"✅ Exact substring match (after normalization)
supporting_text: "protein functions in cells"
reference_content: "The protein functions in cells during mitosis."
# ✓ PASS - exact substring found✅ Multi-part match
supporting_text: "protein functions ... during mitosis"
reference_content: "The protein functions in cells during mitosis."
# ✓ PASS - both parts found✅ Editorial notes ignored
supporting_text: "protein [X] functions"
reference_content: "The protein functions in cells."
# ✓ PASS - [X] ignored in matching✅ Case and punctuation normalized
supporting_text: "T-Cell Receptor"
reference_content: "The t cell receptor binds antigens."
# ✓ PASS - normalized to "t cell receptor"❌ Text not in reference
supporting_text: "protein inhibits apoptosis"
reference_content: "The protein functions in cells."
# ✗ FAIL - "inhibits apoptosis" not found❌ Partial multi-part match
supporting_text: "protein functions ... inhibits apoptosis"
reference_content: "The protein functions in cells."
# ✗ FAIL - second part not found❌ Reference not accessible
supporting_text: "any quote"
reference_id: PMID:99999999
# ✗ FAIL - reference doesn't exist or can't be fetched❌ Title mismatch (when title provided)
reference:
id: PMID:12345678
title: "Wrong Title"
supporting_text: "correct quote"
# ✗ FAIL - title doesn't match fetched referenceYou can create a .linkml-reference-validator.yaml file in your project root to configure validation behavior:
validation:
cache_dir: references_cache
rate_limit_delay: 0.5
# Skip validation for specific prefixes (useful for unsupported reference types)
skip_prefixes:
- SRA # Sequence Read Archive
- MGNIFY # MGnify database
- BIOPROJECT # NCBI BioProject (currently has API issues)
# Control severity for unfetchable references
unknown_prefix_severity: WARNING # Options: ERROR, WARNING, INFO
# Map alternate prefixes to canonical ones
reference_prefix_map:
geo: GEO
NCBIGeo: GEOList of reference prefixes to skip during validation. References with these prefixes will return is_valid=True with INFO severity, allowing validation to pass without blocking your workflow.
Use cases:
- Unsupported reference types (SRA, MGnify, etc.)
- References that are temporarily unavailable
- Third-party databases without registered handlers
Example:
validation:
skip_prefixes:
- SRA
- MGNIFY
- BIOPROJECTWith this configuration:
# These will pass validation with INFO severity
linkml-reference-validator validate text "some text" SRA:PRJNA290729
# ✓ Valid: True (INFO) - Skipping validation for reference with prefix 'SRA'
linkml-reference-validator validate text "some text" MGNIFY:MGYS00000596
# ✓ Valid: True (INFO) - Skipping validation for reference with prefix 'MGNIFY'Control the severity level for references that cannot be fetched (unsupported prefix, network error, etc.). Default: ERROR
Options:
ERROR(default) - Validation fails, blocking workflowWARNING- Validation fails but with lower severityINFO- Validation fails but logged as informational
Note: skip_prefixes takes precedence over unknown_prefix_severity. If a prefix is in skip_prefixes, it will return is_valid=True with INFO severity regardless of this setting.
Example:
validation:
skip_prefixes:
- SRA # These will be skipped (is_valid=True, INFO)
unknown_prefix_severity: WARNING # Other unfetchable refs get WARNINGWith this configuration:
# SRA is skipped (from skip_prefixes)
linkml-reference-validator validate text "text" SRA:PRJNA290729
# ✓ Valid: True (INFO) - Skipping validation
# UNKNOWN prefix gets WARNING severity
linkml-reference-validator validate text "text" UNKNOWN:12345
# ✗ Valid: False (WARNING) - Could not fetch referenceCapture additional fields from reference API responses and append them to cached content so they are included in validation. Keys are source prefixes (e.g. clinicaltrials, PMID, DOI, GEO); values map a field name to a JSONPath expression into the raw API response. Prefer paths to a single value (string/number). If the path selects a list, its elements are converted to strings and joined with spaces. If it selects an object or other type, its string representation is used.
Example:
Save as my-config.yaml:
validation:
source_extra_fields:
clinicaltrials:
eligibility: "$.protocolSection.eligibilityModule.eligibilityCriteria"
outcomes: "$.protocolSection.outcomesModule.primaryOutcomes"Pass this config when fetching so the cache includes these sections: use --config my-config.yaml with cache reference or validate. Captured field names are stored in extra_fields_captured in the cache frontmatter.
# Fetch and cache a trial with extra fields (eligibility, outcomes)
linkml-reference-validator cache reference clinicaltrials:NCT00000001 --config my-config.yaml
# Validate text against the cached content (including extra sections)
linkml-reference-validator validate text "Inclusion: age >= 18" clinicaltrials:NCT00001372 --config my-config.yamlDefault: references_cache/ in current directory
# Custom cache location
export REFERENCE_CACHE_DIR=/path/to/cache
linkml-reference-validator validate text "quote" PMID:123
# Or use CLI option
linkml-reference-validator validate text "quote" PMID:123 \
--cache-dir /path/to/cacheThe tool respects NCBI API rate limits (3 requests/second without API key).
Optional: Set email for NCBI Entrez (recommended):
export NCBI_EMAIL="your.email@example.com"Optional: Use NCBI API key for higher rate limits:
export NCBI_API_KEY="your_api_key_here"Causes:
- PMID doesn't exist
- Network connectivity issues
- NCBI API temporarily unavailable
Solutions:
# Verify PMID exists on PubMed
# Check network connection
# Try again later (NCBI may be down)Causes:
- Abstract not available
- Article behind paywall (no PMC access)
- Retracted article
Solutions:
# Check if article has abstract on PubMed
# Look for PMC full text availability
# Try a different referenceCauses:
- Quote is incorrect or paraphrased
- Text only in figures/tables (not extracted)
- Text uses different terminology
- Unicode characters normalized out
Solutions:
# Verify exact quote from PDF/HTML
# Try shorter, more specific quote
# Check if text is in figure caption
# Use editorial notes for differences: "protein [X] functions"Cause:
- Entire supporting_text is in brackets:
"[editorial note]"
Solution:
# Include actual quote text
supporting_text: "protein functions [in cells]"
# Not just: "[editorial note]"- First validation: ~2-3 seconds (includes fetch + cache)
- Cached validation: ~10-50ms
- Batch validation: ~50ms per reference (cached)
-
Pre-cache references:
# Cache all references before validation for pmid in PMID:111 PMID:222 PMID:333; do linkml-reference-validator cache reference $pmid done
-
Reuse cache directory:
# Share cache across projects export REFERENCE_CACHE_DIR=~/.reference_cache
-
Use verbose mode to see what's slow:
linkml-reference-validator validate data data.yaml \ --schema schema.yaml \ --verbose
# Clone repository
git clone https://github.com/linkml/linkml-reference-validator
cd linkml-reference-validator
# Install with dev dependencies
uv sync --group dev
# Run tests
just test
# Run specific test
uv run pytest tests/test_cli.py::test_validate_text_command_successlinkml-reference-validator/
├── src/linkml_reference_validator/
│ ├── cli.py # CLI commands
│ ├── models.py # Data models
│ ├── validation/
│ │ └── supporting_text_validator.py # Core validation logic
│ ├── etl/
│ │ └── reference_fetcher.py # Reference fetching
│ └── plugins/
│ └── reference_validation_plugin.py # LinkML plugin
├── tests/
│ ├── fixtures/ # Test reference files
│ ├── test_cli.py # CLI tests
│ ├── test_e2e_integration.py # End-to-end tests
│ └── ...
├── justfile # Development commands
└── pyproject.toml # Project configuration
# All tests
just test
# Just pytest
just pytest
# With coverage
uv run pytest --cov=src/linkml_reference_validator
# Specific test file
uv run pytest tests/test_cli.py
# Doctests
just doctestWhile the CLI is recommended, you can also use the Python API:
from linkml_reference_validator.validation.supporting_text_validator import (
SupportingTextValidator
)
from linkml_reference_validator.models import ReferenceValidationConfig
# Create validator
config = ReferenceValidationConfig(cache_dir="my_cache")
validator = SupportingTextValidator(config)
# Validate text
result = validator.validate(
supporting_text="protein functions in cell cycle regulation",
reference_id="PMID:12345678",
)
print(result.is_valid) # True/False
print(result.message) # Validation message- PubMed only - Currently only supports PMID references (DOI and URLs coming soon)
- Text extraction - Only extracts text from abstracts and main article text (not figures, tables, or supplementary materials)
- Unicode normalization - Greek letters and special symbols are removed during normalization (e.g., α → a, β → b)
- No fuzzy matching - Uses deterministic substring matching only (intentional design choice)
- English-centric - Text normalization assumes English text
- Greek letters: "α-catenin" matches "a catenin" or "catenin"
- Chemical formulas: "H₂O" becomes "h o" or "h2o"
- Hyphens: "T-cell" matches "t cell"
- Abbreviations: Must match exactly as they appear (normalized)
| Manual | linkml-reference-validator |
|---|---|
| ❌ Time consuming | ✅ Automated |
| ❌ Error prone | ✅ Consistent |
| ❌ Not scalable | ✅ Validates 100s of quotes |
| ❌ Not reproducible | ✅ Cached, versioned |
linkml-reference-validator uses deterministic substring matching, not fuzzy matching:
✅ Predictable - Same input always gives same result ✅ Explainable - Easy to understand why validation passed/failed ✅ No false positives - Won't accept paraphrased text ✅ Fast - No complex similarity calculations
Contributions welcome! See CONTRIBUTING.md for guidelines.
- DOI support
- URL/webpage support
- Better Unicode handling
- Performance improvements for large batches
- More comprehensive error messages
If you use this tool in your research, please cite:
@software{linkml_reference_validator,
title = {linkml-reference-validator: Validation of supporting text from references},
author = {Mungall, Chris},
year = {2024},
url = {https://github.com/linkml/linkml-reference-validator}
}Apache 2.0 - see LICENSE
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Full Documentation
- LinkML - Modeling language for linked data
- linkml-validator - Core LinkML validation
- ai-gene-reviews - Inspiration for this project