feat: add logs-to-training pipeline prototype (PII-safe ingest, SFT/DPO export, validation, split integrity) by theUtkarshRaj · Pull Request #4 · OpenAgriNet/training_setup_logs

theUtkarshRaj · 2026-05-02T21:45:31Z

Summary

This PR adds an implementation-first prototype for transforming production QA + agentic logs into privacy-safe, trajectory-aware training datasets.

Included

Canonical Langfuse/Pydantic log schema
PII redaction with stable placeholders
Session segmentation
Complexity tagging
Tool consistency validation
LoRA-ready SFT export
DPO candidate generation + hard negatives
Split integrity safeguards (disjoint SFT/DPO/eval)
Gold examples
Validation CLI
CI + tests
Apache-2.0 licensing

Motivation

This aligns directly with the DMP 2026 project goal of converting production logs into safe, trajectory-aware datasets suitable for SFT (LoRA) and DPO workflows, with emphasis on data quality, privacy, and downstream training readiness.

Validation

pytest passing
ruff clean
GitHub Actions CI included

Notes

This is designed as a modular prototype aligned with midpoint milestone goals. If preferred, I’m happy to split this into smaller PRs (core pipeline / validation / DPO extensions) for easier review.

…te CLI) Adds logs_to_training package: Langfuse-style ingest, redaction, complexity tags, tool consistency checks, LoRA-ready SFT JSONL, DPO candidates with hard-negative templates, disjoint split export, gold_examples fixtures, and pytest coverage. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

theUtkarshRaj · 2026-05-02T22:20:24Z

Built this as a full implementation-first prototype for the complete DMP midpoint slice, but with particular emphasis beyond baseline ingest/export on downstream governance layers:

split integrity (disjoint SFT / DPO / eval)
validation CLI
hard-negative DPO generation
gold example fixtures
CI / packaging

My intent was to complement core prototype directions with stronger training-readiness, quality safeguards, and reviewer tooling. Happy to modularize or split advanced layers if narrower review boundaries are preferred.

…on; fix placeholder double-redaction - Custom PatternRecognizer for +91/0091 and bare 10-digit Indian mobiles (scores 0.90 / 0.75) so it wins over US_DRIVER_LICENSE and UK_NHS - _ENTITY_TYPE_NORM expanded: US_DRIVER_LICENSE, UK_NINO, UK_NHS → PHONE_NUMBER; IN_PAN, IN_AADHAAR → GOVT_ID — wrong region labels never reach placeholders - _apply_regex_redactions split on _PLACEHOLDER_RE so the regex pass never re-enters already-inserted placeholder tokens (fixes <X_1>UMBER_1>BER_1> corruption) - _presidio_redact deduplicates overlapping Presidio spans (keep highest score) to prevent position-based string corruption from multi-recognizer same-span hits - _ORPHANED_CC_RE strips stranded +CC prefixes left when Presidio redacts only the digit span (e.g. '+91 <PHONE_NUMBER_1>' → '<PHONE_NUMBER_1>') - New Presidio-specific tests use pytest.importorskip so CI passes without Presidio - .claude/ added to .gitignore

…ves, synthetic hooks

theUtkarshRaj and others added 3 commits May 3, 2026 03:03

chore: add Apache-2.0 LICENSE, GitHub Actions CI, ruff fix

705c594

Co-authored-by: Cursor <cursoragent@cursor.com>

docs: add Quick start to root README

9be60bd

Co-authored-by: Cursor <cursoragent@cursor.com>

theUtkarshRaj mentioned this pull request May 2, 2026

[DMP 2026]: Logs-to-training pipeline for agentic setups #1

Open

5 tasks

theUtkarshRaj force-pushed the feat/logs-to-training-pipeline branch from 7dad578 to d60d201 Compare May 26, 2026 13:21

test: add coverage for segmenter, complexity, DPO export, hard negati…

54abf2c

…ves, synthetic hooks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add logs-to-training pipeline prototype (PII-safe ingest, SFT/DPO export, validation, split integrity)#4

feat: add logs-to-training pipeline prototype (PII-safe ingest, SFT/DPO export, validation, split integrity)#4
theUtkarshRaj wants to merge 5 commits into
OpenAgriNet:mainfrom
theUtkarshRaj:feat/logs-to-training-pipeline

theUtkarshRaj commented May 2, 2026

Uh oh!

theUtkarshRaj commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

theUtkarshRaj commented May 2, 2026

Summary

Included

Motivation

Validation

Notes

Uh oh!

theUtkarshRaj commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant