feat: Add file cleanup for expire snapshots#592
feat: Add file cleanup for expire snapshots#592shangxinli wants to merge 1 commit intoapache:mainfrom
Conversation
0bb8356 to
596c5d2
Compare
Implement the file cleanup logic that was missing from the expire
snapshots feature (the original PR noted "TODO: File recycling will
be added in a followup PR").
Port the "reachable file cleanup" strategy from Java's
ReachableFileCleanup, following the same phased approach:
Phase 1: Collect manifest paths from expired and retained snapshots
Phase 2: Prune manifests still referenced by retained snapshots
Phase 3: Find data files only in manifests being deleted, subtract
files still reachable from retained manifests (kAll only)
Phase 4: Delete orphaned manifest files
Phase 5: Delete manifest lists from expired snapshots
Phase 6: Delete expired statistics and partition statistics files
Key design decisions matching Java parity:
- Best-effort deletion: suppress errors on individual file deletions
to avoid blocking metadata updates (Java suppressFailureWhenFinished)
- Branch/tag awareness: retained snapshot set includes all snapshots
reachable from any ref (branch or tag), preventing false-positive
deletions of files still referenced by non-main branches
- Data file safety: only delete data files from manifests that are
themselves being deleted, then subtract any files still reachable
from retained manifests (two-pass approach from ReachableFileCleanup)
- Respect CleanupLevel: kNone skips all, kMetadataOnly skips data
files, kAll cleans everything
- FileIO abstraction: uses FileIO::DeleteFile for filesystem
compatibility (S3, HDFS, local), with custom DeleteWith() override
- Statistics cleanup via snapshot ID membership in retained set
TODOs for follow-up:
- Multi-threaded file deletion (Java uses Tasks.foreach with executor)
- IncrementalFileCleanup strategy for linear ancestry optimization
(Java uses this when no branches/cherry-picks involved)
596c5d2 to
d3df6e3
Compare
| // Find the ManifestFile for this path by scanning expired snapshots | ||
| for (const auto& snapshot : metadata.snapshots) { | ||
| if (!snapshot) continue; | ||
| SnapshotCache snapshot_cache(snapshot.get()); |
There was a problem hiding this comment.
Performance / Logic Issue: Instantiating SnapshotCache locally inside the nested loop causes O(M*S) I/O complexity. The cache is not preserved across loop iterations, meaning for every manifest path being processed, the manifest list of every snapshot is repeatedly downloaded and parsed from storage.
Suggestion: Consider collecting ManifestFile objects in Phase 1 (similar to Java's Set<ManifestFile> deletionCandidates) or pre-caching them in a map to avoid redundant I/O.
| manifest, file_io, schema_result.value(), spec_result.value()); | ||
| if (!reader_result.has_value()) continue; | ||
|
|
||
| auto entries_result = reader_result.value()->Entries(); |
There was a problem hiding this comment.
Parity / Logic Issue: C++ uses Entries() which includes DELETED entries, whereas Java uses ManifestFiles.readPaths which delegates to liveEntries() (only ADDED and EXISTING).
Details: If a data file is added in an expired snapshot and deleted in a retained snapshot, using Entries() will read the DELETED entry in Phase 2 and subtract the data file from data_files_to_delete. This prevents the physical file from being deleted, resulting in a storage leak.
Suggestion: Change reader_result.value()->Entries() to reader_result.value()->LiveEntries() to strictly match Java's behavior.
|
|
||
| for (const auto& snapshot : metadata.snapshots) { | ||
| if (!snapshot) continue; | ||
| SnapshotCache snapshot_cache(snapshot.get()); |
There was a problem hiding this comment.
Performance / Logic Issue: Similar to line 371, instantiating SnapshotCache here inside the loop repeats the O(M*S) I/O complexity.
| manifest, file_io, schema_result.value(), spec_result.value()); | ||
| if (!reader_result.has_value()) continue; | ||
|
|
||
| auto entries_result = reader_result.value()->Entries(); |
There was a problem hiding this comment.
Parity / Logic Issue: Similar to line 387, use LiveEntries() here instead of Entries() to avoid a potential storage leak.
| // Since Finalize runs before table_ is updated, "after" is base() minus expired. | ||
| std::unordered_set<int64_t> retained_stats_snapshots(retained_snapshot_ids); | ||
| for (const auto& stat_file : metadata.statistics) { | ||
| if (stat_file && !retained_stats_snapshots.contains(stat_file->snapshot_id)) { |
There was a problem hiding this comment.
Inconsistent Behavior: C++ deletes statistics files by checking if the snapshot_id of the StatisticsFile is in the retained_stats_snapshots set. Java computes the set difference of actual file paths. If a statistics file path is referenced by multiple snapshots, deleting purely based on the expiration of a specific snapshot_id could erroneously delete a physical file that is still referenced by a newer, retained snapshot. Diff paths, not just snapshot IDs.
|
|
||
| auto reader_result = ManifestReader::Make( | ||
| manifest, file_io, schema_result.value(), spec_result.value()); | ||
| if (!reader_result.has_value()) continue; |
There was a problem hiding this comment.
Inconsistent Behavior (Data Loss Risk): If reading a retained manifest fails, the C++ implementation silently ignores it (continue). Java uses .retry(3) and .throwFailureWhenFinished(). If we silently ignore a read failure here, we will fail to subtract its live data files from data_files_to_delete, resulting in accidental data loss (deleting a physical file that is still actively used). Failures here should abort the deletion of those specific data files.
| data_files_to_delete.insert(entry.data_file->file_path); | ||
| } | ||
| } | ||
| goto next_manifest; // Found and processed this manifest, move to next |
There was a problem hiding this comment.
C++ Style Issue: The use of goto next_manifest; to break out of nested loops is a non-idiomatic C++ anti-pattern. Consider moving this manifest lookup logic into a helper function (e.g., std::optional<ManifestFile> GetManifestByPath(path)).
| Status ExpireSnapshots::FindDataFilesToDelete( | ||
| const std::unordered_set<std::string>& manifests_to_delete, | ||
| const std::unordered_set<std::string>& retained_manifests, | ||
| std::unordered_set<std::string>& data_files_to_delete) { |
There was a problem hiding this comment.
C++ Style Issue: Using mutable out-parameters (std::unordered_set<std::string>& data_files_to_delete) over Result is a less modern C++ pattern. Consider returning Result<std::unordered_set<std::string>>.
Summary
ReachableFileCleanup