Processor durability and processor parallelism #144
Conversation
52803fe to
9e49edc
Compare
|
@klaudworks @kamir - pls review so we can merge and prep for a new release soon. |
kamir
left a comment
There was a problem hiding this comment.
Findings
High: pkg/storage/recovery.go:178-210 can skip valid records. The PR only truncates the last segment with CreatedAt <= RestoreTo, but segment CreatedAt is flush time (pkg/storage/log.go:327), not necessarily the first record timestamp. If records before the cutoff are flushed into a segment after the cutoff, exact PITR never inspects that segment and silently drops eligible data. Add a test where segment.CreatedAt > RestoreTo but the first record timestamp is before RestoreTo, then include at least the first post-cutoff segment as the truncation candidate or base candidate selection on record timestamps.
Medium: pkg/storage/recovery_exact.go:159-168 allocates recordLen bytes from untrusted segment data before checking it fits in the remaining reader. A corrupt S3 segment can cause a huge allocation/panic instead of a controlled restore error. Same hardening should be applied to index metadata parsing around pkg/storage/index.go:109-122 for negative/oversized counts.
Medium/docs: docs/operations.md:240-254 still documents restore as segment-granular using segment creation time. PR #144 now claims exact final-segment truncation, and pkg/storage/recovery_exact.go:112-115 intentionally fails on compressed intersecting batches. The operator docs should describe the new semantics and the compression limitation before release.
Checks Run
On PR head 9e49edc:
git diff --check upstream/main...upstream/pr-144 passed.
go test ./pkg/storage passed.
go test ./cmd/kafscale-cli passed outside sandbox due local etcd TCP binding.
go test ./internal/decoder passed in both SQL and Iceberg processor modules.
Summary
This implements #113 by removing an unnecessary
.indexdownload from the processor decode path instead of parallelizing two S3 reads where only one was actually used.It also fixes a PITR bug by making restore truncate the final segment at the requested cutoff and regenerate a matching
.index, rather than only restoring whole segments.While tracing that path, I also tightened PITR/restore behavior in storage so the final recovered segment can be truncated at the requested cutoff and a matching
.indexis regenerated.Changes
.indexin the SQL and Iceberg processor decoders.kfs/.indexpairing validation in discovery.indexafter truncationTesting
go test ./pkg/storage ./cmd/kafscale-cli -count=1 -timeout=120sgo test ./... -count=1 -timeout=120sinaddons/processors/sql-processorgo test ./... -count=1 -timeout=120sinaddons/processors/iceberg-processorChecklist