feat: Add table.maintenance.compact() for full-table data file compaction by qzyu999 · Pull Request #3124 · apache/iceberg-python

qzyu999 · 2026-03-06T05:39:43Z

Rationale for this change

This introduces a simplified, whole-table compaction strategy via the MaintenanceTable API (table.maintenance.compact()).

Key implementation details:

Reads the entire table state into memory via .to_arrow().
- Note: This initial implementation uses an in-memory Arrow-based rewrite strategy. Future iterations can extend this to support streaming or distributed rewrites for larger-than-memory datasets.
Uses table.overwrite() to rewrite data, leveraging PyIceberg's target file bin-packing (write.target-file-size-bytes) natively.
Ensures atomicity by executing within a table transaction.
Explicitly sets snapshot-type: replace and replace-operation: compaction to ensure correct metadata history for downstream engines.
Includes a guard to safely ignore compaction requests on empty tables.

Are these changes tested?

Includes full Pytest coverage in tests/table/test_maintenance.py.

Are there any user-facing changes?

Yes. This PR adds a new compact() method to the TableMaintenance API, allowing users to perform file compaction on existing Iceberg tables.

Example usage:

table = catalog.load_table("default.my_table")
# Merges small files into larger ones based on table properties
table.maintenance.compact()

Edit: It looks like I'm not able to add the changelog label, hopefully someone with permissions can do so.

This introduces a simplified, whole-table compaction strategy via the MaintenanceTable API (`table.maintenance.compact()`). Key implementation details: - Reads the entire table state into memory via `.to_arrow()`. - Uses `table.overwrite()` to rewrite data, leveraging PyIceberg's target file bin-packing (`write.target-file-size-bytes`) natively. - Ensures atomicity by executing within a table transaction. - Explicitly sets `snapshot-type: replace` and `replace-operation: compaction` to ensure correct metadata history for downstream engines. - Includes a guard to safely ignore compaction requests on empty tables. Includes full Pytest coverage in `tests/table/test_maintenance.py`. Closes apache#1092

kevinjqliu · 2026-03-06T19:15:46Z

pyiceberg/table/maintenance.py

+
+        # Overwrite the table atomically (REPLACE operation)
+        with self.tbl.transaction() as txn:
+            txn.overwrite(arrow_table, snapshot_properties={"snapshot-type": "replace", "replace-operation": "compaction"})


i think we should have a replace operation instead
https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/DataOperations.html#REPLACE

we might want to create the .replace() first

Hi @kevinjqliu, thanks for the insight, I agree with what you're saying in terms of building a replace rather than just reusing the overwrite. I've refactored the compaction run to properly use a .replace() API, following the design of the Java Iceberg implementation.

The approach is to create a new _RewriteFiles in pyiceberg/table/update/snapshot.py, which utilizes the new Operation.REPLACE from pyiceberg/table/update/snapshots.py. The _RewriteFiles utilizes the replace(), which effectively mimics the _OverwriteFiles operation, with the exception that it uses Operation.REPLACE instead of Operation.OVERWRITE. This allows MaintenanceTable.compact() to do a proper txn.replace() rather than reuse txn.overwrite().

I also think it's worth noting that by adding Operation.REPLACE, we make room for the needed rewrite manifests (#270) and delete orphan files functionality (#1200).

kevinjqliu · 2026-03-06T19:17:02Z

tests/table/test_maintenance.py

+    after_files = list(table.scan().plan_files())
+    assert len(after_files) == 3  # Should be 1 optimized data file per partition
+    assert table.scan().to_arrow().num_rows == 120
+


since its a small result set, we should verify the data is the same too

Hi @kevinjqliu, made a change in 6420027 to check that the columns and the primary keys remain the same before/after compaction.

kevinjqliu · 2026-03-06T19:17:19Z

pyiceberg/table/maintenance.py

        return ExpireSnapshots(transaction=Transaction(self.tbl, autocommit=True))
+
+    def compact(self) -> None:
+        """Compact the table's data files by reading and overwriting the entire table.


this should be data and delete files. but generally it compacts the entire table

Hi @kevinjqliu, made the update to the docstring here: 9fd51a8.

…ction in test_maintenance_compact()

Formats the [compact](iceberg-python/pyiceberg/table/maintenance.py) method docstring to ensure the summary line does not wrap and correctly ends with a period, satisfying pydocstyle D205 and D400 rules.

Replaces the use of .overwrite() in MaintenanceTable.compact() with a new .replace() API backed by a _RewriteFiles producer. This ensures compaction now generates an Operation.REPLACE snapshot instead of Operation.OVERWRITE, preserving logical table state for downstream consumers. Fixes apache#1092

kevinjqliu · 2026-03-09T17:20:21Z

pyiceberg/table/__init__.py

                for data_file in data_files:
                    append_files.append_data_file(data_file)

+    def replace(


lets add replace on its own since its a pretty significant change and follow up with table compaction.

i think there are a few more things we need to add to the replace operation. Would be a good idea to look into the java side. For example, how can we ensure that the table's data remains the same? REPLACE means no data change. If we cannot guarantee that the data remains the same, maybe we should not expose a replace function that takes a df as a parameter

Hi @kevinjqliu, I created an issue (#3130) and a corresponding PR (#3131) to address the need to create a separate PR for replace. When approved, we can use that to build and complete this current PR for compaction. We can move this discussion to there and come back when finished.

EnyMan · 2026-03-19T14:51:25Z

I have been working on similar functionality for a while as part of my upsert optimization efforts. https://github.com/EnyMan/iceberg-python/blob/rewrite-data-files/pyiceberg/table/maintenance.py#L47, we had used it extensively in our production environment. (10K+ rewrites) It should be basically a clone of the Java version, and I was planning on creating a PR, but I never got to it until now, and now I see there is already some work being done on it. But i use Operation.OVERWRITE operation instead of replace.

qzyu999 · 2026-03-19T17:24:49Z

I have been working on similar functionality for a while as part of my upsert optimization efforts. https://github.com/EnyMan/iceberg-python/blob/rewrite-data-files/pyiceberg/table/maintenance.py#L47, we had used it extensively in our production environment. (10K+ rewrites) It should be basically a clone of the Java version, and I was planning on creating a PR, but I never got to it until now, and now I see there is already some work being done on it. But i use Operation.OVERWRITE operation instead of replace.

Hi @EnyMan, thanks for sharing your work! I took a look at your code, IIUC it seems that it's taking the new files and adding them and getting the old files and deleting them, an Operation.OVERWRITE as you mentioned. I had done something similarly in the beginning, but I now believe there is a flaw to that from the Java perspective:

OVERWRITE means new data is added to overwrite existing data
REPLACE means files are moved and replaced without changing the data in the table

This has impacts for time travel and conflict resolution.

If a snapshot is marked as REPLACE, the reader knows that the underlying files were strictly restructured (e.g., compacted from 10 small files to 1 large file) but no new logical records were inserted, updated, or deleted. The reader can safely ignore this snapshot.
If you use OVERWRITE for a compaction job, downstream processes may incorrectly perceive the compacted files as new data, potentially leading to duplicate processing.
During optimistic concurrency control, Iceberg uses the operation type to determine if two concurrent commits conflict. Because REPLACE strictly promises no logical changes, Iceberg's commit protocol can often safely re-apply a REPLACE operation alongside other concurrent data modifications (provided the specific files being replaced haven't been deleted).

For reasons that I believe are related to the above examples, @kevinjqliu requested we first implement the Operation.REPLACE functionality (#3130, #3131), and then come back to this issue/PR and complete the redesign. I saw that your code seems to have lots of those additional features that exist in Java's compaction function. As mentioned in #1092, the initial version of PyIceberg's can first start with the basic harness and iterate towards the level of completion that your implementation has in future issues/PR's. Following this logic, I believe once #3130 and #1092 are completed, your code would be quite valuable for quickly implementing compaction and adding those additional features to PyIceberg.

Insights were assisted with AI

qzyu999 added 2 commits March 5, 2026 21:32

fix: address linting and mypy type errors in maintenance tests

2774bd3

kevinjqliu reviewed Mar 6, 2026

View reviewed changes

qzyu999 added 5 commits March 6, 2026 12:22

fix: verify that the table itself remains the same before/after compa…

6420027

…ction in test_maintenance_compact()

style: fix trailing whitespace in test_maintenance.py

dfbde71

docs: update compact() docstring to include delete files

9fd51a8

chore: fix pydocstyle warnings in maintenance.py

93df231

Formats the [compact](iceberg-python/pyiceberg/table/maintenance.py) method docstring to ensure the summary line does not wrap and correctly ends with a period, satisfying pydocstyle D205 and D400 rules.

kevinjqliu reviewed Mar 9, 2026

View reviewed changes

This was referenced Mar 9, 2026

Feature: Add metadata-only replace API to Table for REPLACE snapshot operations #3130

Open

feat: Add metadata-only replace API to Table for REPLACE snapshot operations #3131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add table.maintenance.compact() for full-table data file compaction#3124

feat: Add table.maintenance.compact() for full-table data file compaction#3124
qzyu999 wants to merge 7 commits intoapache:mainfrom
qzyu999:feat-compaction-issue-1092

qzyu999 commented Mar 6, 2026 •

edited

Loading

Uh oh!

kevinjqliu Mar 6, 2026

Uh oh!

qzyu999 Mar 7, 2026

Uh oh!

kevinjqliu Mar 6, 2026

Uh oh!

qzyu999 Mar 6, 2026

Uh oh!

kevinjqliu Mar 6, 2026

Uh oh!

qzyu999 Mar 6, 2026 •

edited

Loading

Uh oh!

kevinjqliu Mar 9, 2026

Uh oh!

qzyu999 Mar 9, 2026

Uh oh!

EnyMan commented Mar 19, 2026

Uh oh!

qzyu999 commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qzyu999 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kevinjqliu Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

qzyu999 Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

qzyu999 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

qzyu999 Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

qzyu999 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

EnyMan commented Mar 19, 2026

Uh oh!

qzyu999 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qzyu999 commented Mar 6, 2026 •

edited

Loading

qzyu999 Mar 6, 2026 •

edited

Loading

qzyu999 commented Mar 19, 2026 •

edited

Loading