Skip to content

feat: add logs-to-training pipeline prototype (PII-safe ingest, SFT/DPO export, validation, split integrity)#4

Open
theUtkarshRaj wants to merge 5 commits into
OpenAgriNet:mainfrom
theUtkarshRaj:feat/logs-to-training-pipeline
Open

feat: add logs-to-training pipeline prototype (PII-safe ingest, SFT/DPO export, validation, split integrity)#4
theUtkarshRaj wants to merge 5 commits into
OpenAgriNet:mainfrom
theUtkarshRaj:feat/logs-to-training-pipeline

Conversation

@theUtkarshRaj
Copy link
Copy Markdown

Summary

This PR adds an implementation-first prototype for transforming production QA + agentic logs into privacy-safe, trajectory-aware training datasets.

Included

  • Canonical Langfuse/Pydantic log schema
  • PII redaction with stable placeholders
  • Session segmentation
  • Complexity tagging
  • Tool consistency validation
  • LoRA-ready SFT export
  • DPO candidate generation + hard negatives
  • Split integrity safeguards (disjoint SFT/DPO/eval)
  • Gold examples
  • Validation CLI
  • CI + tests
  • Apache-2.0 licensing

Motivation

This aligns directly with the DMP 2026 project goal of converting production logs into safe, trajectory-aware datasets suitable for SFT (LoRA) and DPO workflows, with emphasis on data quality, privacy, and downstream training readiness.

Validation

  • pytest passing
  • ruff clean
  • GitHub Actions CI included

Notes

This is designed as a modular prototype aligned with midpoint milestone goals. If preferred, I’m happy to split this into smaller PRs (core pipeline / validation / DPO extensions) for easier review.

theUtkarshRaj and others added 3 commits May 3, 2026 03:03
…te CLI)

Adds logs_to_training package: Langfuse-style ingest, redaction, complexity tags, tool consistency checks, LoRA-ready SFT JSONL, DPO candidates with hard-negative templates, disjoint split export, gold_examples fixtures, and pytest coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@theUtkarshRaj
Copy link
Copy Markdown
Author

Built this as a full implementation-first prototype for the complete DMP midpoint slice, but with particular emphasis beyond baseline ingest/export on downstream governance layers:

  • split integrity (disjoint SFT / DPO / eval)
  • validation CLI
  • hard-negative DPO generation
  • gold example fixtures
  • CI / packaging

My intent was to complement core prototype directions with stronger training-readiness, quality safeguards, and reviewer tooling. Happy to modularize or split advanced layers if narrower review boundaries are preferred.

…on; fix placeholder double-redaction

- Custom PatternRecognizer for +91/0091 and bare 10-digit Indian mobiles
  (scores 0.90 / 0.75) so it wins over US_DRIVER_LICENSE and UK_NHS
- _ENTITY_TYPE_NORM expanded: US_DRIVER_LICENSE, UK_NINO, UK_NHS → PHONE_NUMBER;
  IN_PAN, IN_AADHAAR → GOVT_ID — wrong region labels never reach placeholders
- _apply_regex_redactions split on _PLACEHOLDER_RE so the regex pass never
  re-enters already-inserted placeholder tokens (fixes <X_1>UMBER_1>BER_1> corruption)
- _presidio_redact deduplicates overlapping Presidio spans (keep highest score)
  to prevent position-based string corruption from multi-recognizer same-span hits
- _ORPHANED_CC_RE strips stranded +CC prefixes left when Presidio redacts only
  the digit span (e.g. '+91 <PHONE_NUMBER_1>' → '<PHONE_NUMBER_1>')
- New Presidio-specific tests use pytest.importorskip so CI passes without Presidio
- .claude/ added to .gitignore
@theUtkarshRaj theUtkarshRaj force-pushed the feat/logs-to-training-pipeline branch from 7dad578 to d60d201 Compare May 26, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant