Skip to content

Processor durability and processor parallelism #144

Merged
kamir merged 3 commits into
KafScale:mainfrom
novatechflow:processor-durability
May 21, 2026
Merged

Processor durability and processor parallelism #144
kamir merged 3 commits into
KafScale:mainfrom
novatechflow:processor-durability

Conversation

@novatechflow
Copy link
Copy Markdown
Collaborator

Summary

This implements #113 by removing an unnecessary .index download from the processor decode path instead of parallelizing two S3 reads where only one was actually used.

It also fixes a PITR bug by making restore truncate the final segment at the requested cutoff and regenerate a matching .index, rather than only restoring whole segments.

While tracing that path, I also tightened PITR/restore behavior in storage so the final recovered segment can be truncated at the requested cutoff and a matching .index is regenerated.

Changes

  • stop downloading .index in the SQL and Iceberg processor decoders
  • keep .kfs/.index pairing validation in discovery
  • add regression coverage for the decoder behavior
  • add exact segment truncation for PITR/restore
  • rebuild the final restored .index after truncation
  • add recovery and CLI regression coverage with real batch fixtures

Testing

  • go test ./pkg/storage ./cmd/kafscale-cli -count=1 -timeout=120s
  • go test ./... -count=1 -timeout=120s in addons/processors/sql-processor
  • go test ./... -count=1 -timeout=120s in addons/processors/iceberg-processor

Checklist

  • Added/updated unit tests for new logic
  • Added/updated e2e coverage for bug fixes
  • Added license headers to new files

@novatechflow novatechflow force-pushed the processor-durability branch from 52803fe to 9e49edc Compare May 18, 2026 06:23
@novatechflow novatechflow self-assigned this May 18, 2026
@novatechflow novatechflow requested a review from klaudworks May 18, 2026 06:26
@novatechflow novatechflow linked an issue May 18, 2026 that may be closed by this pull request
@novatechflow novatechflow requested a review from kamir May 18, 2026 08:00
@novatechflow
Copy link
Copy Markdown
Collaborator Author

@klaudworks @kamir - pls review so we can merge and prep for a new release soon.

Copy link
Copy Markdown
Collaborator

@kamir kamir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

High: pkg/storage/recovery.go:178-210 can skip valid records. The PR only truncates the last segment with CreatedAt <= RestoreTo, but segment CreatedAt is flush time (pkg/storage/log.go:327), not necessarily the first record timestamp. If records before the cutoff are flushed into a segment after the cutoff, exact PITR never inspects that segment and silently drops eligible data. Add a test where segment.CreatedAt > RestoreTo but the first record timestamp is before RestoreTo, then include at least the first post-cutoff segment as the truncation candidate or base candidate selection on record timestamps.

Medium: pkg/storage/recovery_exact.go:159-168 allocates recordLen bytes from untrusted segment data before checking it fits in the remaining reader. A corrupt S3 segment can cause a huge allocation/panic instead of a controlled restore error. Same hardening should be applied to index metadata parsing around pkg/storage/index.go:109-122 for negative/oversized counts.

Medium/docs: docs/operations.md:240-254 still documents restore as segment-granular using segment creation time. PR #144 now claims exact final-segment truncation, and pkg/storage/recovery_exact.go:112-115 intentionally fails on compressed intersecting batches. The operator docs should describe the new semantics and the compression limitation before release.

Checks Run

On PR head 9e49edc:

git diff --check upstream/main...upstream/pr-144 passed.
go test ./pkg/storage passed.
go test ./cmd/kafscale-cli passed outside sandbox due local etcd TCP binding.
go test ./internal/decoder passed in both SQL and Iceberg processor modules.

@novatechflow novatechflow requested a review from kamir May 19, 2026 16:14
@kamir kamir merged commit 076a649 into KafScale:main May 21, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: parallelize S3 .index + .kfs downloads in processors

2 participants