Skip to content

feat: Add file cleanup for expire snapshots#592

Open
shangxinli wants to merge 1 commit intoapache:mainfrom
shangxinli:expire-snapshots-file-cleanup
Open

feat: Add file cleanup for expire snapshots#592
shangxinli wants to merge 1 commit intoapache:mainfrom
shangxinli:expire-snapshots-file-cleanup

Conversation

@shangxinli
Copy link
Contributor

@shangxinli shangxinli commented Mar 16, 2026

Summary

  • Implement the file cleanup logic missing from expire snapshots (feat: support expire snapshots #490 noted "TODO: File recycling will be added in a followup PR")
  • Port the "reachable file cleanup" strategy from Java's ReachableFileCleanup
  • Single-threaded implementation; multi-threaded and incremental cleanup as TODOs

@shangxinli shangxinli force-pushed the expire-snapshots-file-cleanup branch from 0bb8356 to 596c5d2 Compare March 16, 2026 16:20
Implement the file cleanup logic that was missing from the expire
snapshots feature (the original PR noted "TODO: File recycling will
be added in a followup PR").

Port the "reachable file cleanup" strategy from Java's
ReachableFileCleanup, following the same phased approach:

Phase 1: Collect manifest paths from expired and retained snapshots
Phase 2: Prune manifests still referenced by retained snapshots
Phase 3: Find data files only in manifests being deleted, subtract
         files still reachable from retained manifests (kAll only)
Phase 4: Delete orphaned manifest files
Phase 5: Delete manifest lists from expired snapshots
Phase 6: Delete expired statistics and partition statistics files

Key design decisions matching Java parity:
- Best-effort deletion: suppress errors on individual file deletions
  to avoid blocking metadata updates (Java suppressFailureWhenFinished)
- Branch/tag awareness: retained snapshot set includes all snapshots
  reachable from any ref (branch or tag), preventing false-positive
  deletions of files still referenced by non-main branches
- Data file safety: only delete data files from manifests that are
  themselves being deleted, then subtract any files still reachable
  from retained manifests (two-pass approach from ReachableFileCleanup)
- Respect CleanupLevel: kNone skips all, kMetadataOnly skips data
  files, kAll cleans everything
- FileIO abstraction: uses FileIO::DeleteFile for filesystem
  compatibility (S3, HDFS, local), with custom DeleteWith() override
- Statistics cleanup via snapshot ID membership in retained set

TODOs for follow-up:
- Multi-threaded file deletion (Java uses Tasks.foreach with executor)
- IncrementalFileCleanup strategy for linear ancestry optimization
  (Java uses this when no branches/cherry-picks involved)
@shangxinli shangxinli force-pushed the expire-snapshots-file-cleanup branch from 596c5d2 to d3df6e3 Compare March 16, 2026 18:26
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This review was generated by Gemini.

// Find the ManifestFile for this path by scanning expired snapshots
for (const auto& snapshot : metadata.snapshots) {
if (!snapshot) continue;
SnapshotCache snapshot_cache(snapshot.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance / Logic Issue: Instantiating SnapshotCache locally inside the nested loop causes O(M*S) I/O complexity. The cache is not preserved across loop iterations, meaning for every manifest path being processed, the manifest list of every snapshot is repeatedly downloaded and parsed from storage.

Suggestion: Consider collecting ManifestFile objects in Phase 1 (similar to Java's Set<ManifestFile> deletionCandidates) or pre-caching them in a map to avoid redundant I/O.

manifest, file_io, schema_result.value(), spec_result.value());
if (!reader_result.has_value()) continue;

auto entries_result = reader_result.value()->Entries();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parity / Logic Issue: C++ uses Entries() which includes DELETED entries, whereas Java uses ManifestFiles.readPaths which delegates to liveEntries() (only ADDED and EXISTING).

Details: If a data file is added in an expired snapshot and deleted in a retained snapshot, using Entries() will read the DELETED entry in Phase 2 and subtract the data file from data_files_to_delete. This prevents the physical file from being deleted, resulting in a storage leak.

Suggestion: Change reader_result.value()->Entries() to reader_result.value()->LiveEntries() to strictly match Java's behavior.


for (const auto& snapshot : metadata.snapshots) {
if (!snapshot) continue;
SnapshotCache snapshot_cache(snapshot.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance / Logic Issue: Similar to line 371, instantiating SnapshotCache here inside the loop repeats the O(M*S) I/O complexity.

manifest, file_io, schema_result.value(), spec_result.value());
if (!reader_result.has_value()) continue;

auto entries_result = reader_result.value()->Entries();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parity / Logic Issue: Similar to line 387, use LiveEntries() here instead of Entries() to avoid a potential storage leak.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This review was generated by Gemini.

// Since Finalize runs before table_ is updated, "after" is base() minus expired.
std::unordered_set<int64_t> retained_stats_snapshots(retained_snapshot_ids);
for (const auto& stat_file : metadata.statistics) {
if (stat_file && !retained_stats_snapshots.contains(stat_file->snapshot_id)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent Behavior: C++ deletes statistics files by checking if the snapshot_id of the StatisticsFile is in the retained_stats_snapshots set. Java computes the set difference of actual file paths. If a statistics file path is referenced by multiple snapshots, deleting purely based on the expiration of a specific snapshot_id could erroneously delete a physical file that is still referenced by a newer, retained snapshot. Diff paths, not just snapshot IDs.


auto reader_result = ManifestReader::Make(
manifest, file_io, schema_result.value(), spec_result.value());
if (!reader_result.has_value()) continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent Behavior (Data Loss Risk): If reading a retained manifest fails, the C++ implementation silently ignores it (continue). Java uses .retry(3) and .throwFailureWhenFinished(). If we silently ignore a read failure here, we will fail to subtract its live data files from data_files_to_delete, resulting in accidental data loss (deleting a physical file that is still actively used). Failures here should abort the deletion of those specific data files.

data_files_to_delete.insert(entry.data_file->file_path);
}
}
goto next_manifest; // Found and processed this manifest, move to next
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ Style Issue: The use of goto next_manifest; to break out of nested loops is a non-idiomatic C++ anti-pattern. Consider moving this manifest lookup logic into a helper function (e.g., std::optional<ManifestFile> GetManifestByPath(path)).

Status ExpireSnapshots::FindDataFilesToDelete(
const std::unordered_set<std::string>& manifests_to_delete,
const std::unordered_set<std::string>& retained_manifests,
std::unordered_set<std::string>& data_files_to_delete) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ Style Issue: Using mutable out-parameters (std::unordered_set<std::string>& data_files_to_delete) over Result is a less modern C++ pattern. Consider returning Result<std::unordered_set<std::string>>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants