Conversation
9355586 to
9cbf7e6
Compare
Introduce a background worker that batches fdatasync calls across all backends, inspired by the Aether paper's flush pipelining. Under N concurrent committing backends, this reduces disk flushes from N to 1. Key changes: - New undo_flush.c/h: flush daemon with ConditionVariable-based group commit. Backends register their highest UndoRecPtr and wait; daemon wakes, calls pg_fdatasync() once per active log file, then broadcasts. - Replace pg_fsync() with pg_fdatasync() in UndoLogSync() and fd cache eviction -- UNDO files are pre-allocated via ftruncate() so metadata sync is unnecessary. - Track per-backend undo_max_write_ptr in UndoLogWrite() for flush coordination. - Fallback: if daemon is not running, backends call UndoLogSync() directly (unchanged behavior). - Register daemon in InternalBGWorkers[] and via ShmemCallbacks request_fn phase (before BackgroundWorkerShmemInit).
The single-segment UNDO log design fails when bulk DML fills the 1GB segment faster than the background discard worker can reclaim space, producing a hard ERROR. Replace this with a multi-segment lifecycle: FREE → ACTIVE → SEALED → DISCARDABLE → (deleted) At most one log is ACTIVE at a time. When it reaches 85% capacity (or at checkpoint if >50% full), it is sealed and a fresh segment activated. At 95% capacity, the allocating backend performs synchronous inline discard and applies graduated backpressure sleep instead of erroring. New subsystem components: - UndoLogSealAndRotate(): seals the active log, creates a new segment, WAL-logs an XLOG_UNDO_ROTATE record for crash recovery - UndoLogTryPressureDiscard(): synchronous inline discard at 95% - UndoLogDeleteSegmentFile(): removes fully-discarded segment files - WakeUndoDiscardWorker(): latch-based wakeup for the discard worker - Enhanced discard worker with two-phase lifecycle transitions and adaptive sleep (200ms when sealed logs pending, else configured nap) - CheckPointUndoLog() now triggers rotation at checkpoint boundaries - pg_undo_force_discard(bool) SQL function for manual rotation/discard - pg_stat_get_undo_logs() now reports lifecycle state column - WAL ROTATE record with redo support for crash recovery - UNDO_WORKER_MAIN wait event for monitoring All 251 regression subtests and 55 recovery tests pass.
When the executor estimates that a ModifyTable node will process more than 1000 rows, it now signals the table AM via a new begin_bulk_insert callback. The heap AM uses this to switch from per-row UNDO recording (create/insert/free an UndoRecordSet per tuple) to batched mode: a single persistent UndoRecordSet accumulates records and flushes them in batches of up to 1000 records or 256KB, whichever comes first. This amortizes UndoLogAllocate, WAL insert, and UndoLogWrite overhead across many rows. At 100K rows, measured UNDO overhead drops from ~100% to 5-10% compared to plain heap operations. Key details: - New TableAmRoutine callback: begin_bulk_insert(rel, options, nrows) - Heap AM: HeapBeginBulkUndo/HeapEndBulkUndo/HeapBulkUndoFlush manage the persistent UndoRecordSet and batch lifecycle - heap_insert, heap_delete, heap_update check HeapBulkUndoIsActive() and route UNDO records through the batch accumulator - Abort safety: a MemoryContextCallback on the UndoRecordSet's context automatically resets the static bulk state on transaction abort, preventing use-after-free on the next transaction - Executor: ExecInitModifyTable calls table_begin_bulk_insert when subplan->plan_rows > 1000; ExecEndModifyTable calls table_finish_bulk_insert to flush remaining records All 251 regression subtests and 55 recovery tests pass.
Introduce the FILEOPS deferred-operations infrastructure following the Berkeley DB fileops.src model. Each filesystem operation is a composable unit with its own WAL record type, redo handler, and descriptor. This commit provides the core machinery only - no specific operations: - PendingFileOp linked list for deferred operations - FileOpsDoPendingOps() executor at transaction commit/abort - Subtransaction support (AtSubCommit/AtSubAbort/PostPrepare) - WAL resource manager shell (RM_FILEOPS_ID) - Platform portability layer (fsync_parent, FileOpsSync) - GUC: enable_transactional_fileops - Transaction lifecycle hooks in xact.c Individual operations (CREATE, DELETE, RENAME, WRITE, TRUNCATE, etc.) are added in subsequent commits.
Implement transactional file creation (BDB: __fop_create). Files are created immediately so they can be used within the transaction. If register_delete is true, the file is automatically deleted on abort. API: FileOpsCreate(path, flags, mode, register_delete) -> fd WAL: XLOG_FILEOPS_CREATE with idempotent redo (creates parent dirs if missing on standbys).
Implement deferred file deletion (BDB: __fop_remove). Deletion is scheduled for transaction commit or abort, not executed immediately. API: FileOpsDelete(path, at_commit) -> void WAL: XLOG_FILEOPS_DELETE (intentional no-op during redo; deletion driven by XACT commit/abort records). On Windows: uses pgunlink() with retry on EACCES.
Implement deferred file rename (BDB: __fop_rename). The rename is scheduled for commit time using durable_rename() which handles fsync ordering on Unix and MoveFileEx with retry on Windows. API: FileOpsRename(oldpath, newpath) -> int WAL: XLOG_FILEOPS_RENAME (intentional no-op during redo).
Implement WAL-logged file write at offset (BDB: __fop_write). Data is written immediately using pwrite() and fsynced for durability. API: FileOpsWrite(path, offset, data, len) -> int WAL: XLOG_FILEOPS_WRITE with redo that replays the write. On Windows: uses SetFilePointerEx + WriteFile via pg_pwrite.
Implement WAL-logged file truncation. Executed immediately with XLogFlush before the irreversible operation (following SMGR_TRUNCATE pattern). Uses ftruncate() on POSIX, SetEndOfFile() on Windows. API: FileOpsTruncate(path, length) -> void WAL: XLOG_FILEOPS_TRUNCATE with redo that replays the truncation.
Implement WAL-logged file metadata operations. CHMOD: chmod() on POSIX, _chmod() on Windows with limited mode bits (only _S_IREAD/_S_IWRITE; no group/other support). CHOWN: chown() on POSIX, no-op with WARNING on Windows (Windows uses ACLs for ownership, not uid/gid). Both execute immediately and are WAL-logged for crash recovery.
MKDIR: Immediate execution using MakePGDirectory(). Registers rmdir-on-abort for automatic cleanup on rollback. On Windows: _mkdir() (no mode parameter, permissions inherited from parent). RMDIR: Deferred to commit time (like DELETE). Uses rmdir() on all platforms, _rmdir() on Windows.
SYMLINK: Immediate execution. Uses symlink() on POSIX, pgsymlink() (NTFS junction points) on Windows. Registers delete-on-abort. LINK: Immediate execution. Uses link() on POSIX, CreateHardLinkA() on Windows (NTFS only). Registers delete-on-abort. Both create links idempotently during redo (unlink first if exists).
Add extended attribute operations to the transactional file operations
framework, completing the Berkeley DB fileops.src operation set.
FileOpsSetXattr() and FileOpsRemoveXattr() provide immediate execution
with WAL logging for crash recovery replay. A new cross-platform
portability layer (src/port/pg_xattr.c) abstracts platform differences:
- Linux: <sys/xattr.h> setxattr/removexattr
- macOS: <sys/xattr.h> with extra options parameter
- FreeBSD: <sys/extattr.h> extattr_set_file/extattr_delete_file
- Windows: NTFS Alternate Data Streams via CreateFileA("path:name")
- Fallback: returns ENOTSUP (operation succeeds in WAL but no-op
on unsupported platforms for WAL stream portability)
Platform detection uses compiler-defined macros (__linux__, __APPLE__,
__FreeBSD__, WIN32) rather than configure-time checks, avoiding
meson.build/configure.ac complexity.
…ctional_fileops GUC FileOpsTruncate, FileOpsChmod, FileOpsSetXattr, and FileOpsRemoveXattr previously relied solely on UNDO records for rollback, but UNDO application is deferred to the background worker (due to BumpContext limitations during abort). This left these operations without immediate rollback on ABORT or ROLLBACK TO SAVEPOINT. Fix by registering abort-time PendingFileOp entries that restore the original file state synchronously at transaction/subtransaction abort, matching the pattern already used by FileOpsLink (PENDING_FILEOP_DELETE with at_commit=false). Changes: - Add PENDING_FILEOP_TRUNCATE/CHMOD/SETXATTR/REMOVEXATTR handlers in FileOpsDoPendingOps() - Extend PendingFileOp struct with data/data_len fields for xattr values - Register abort-time ops in FileOpsTruncate, FileOpsChmod, FileOpsSetXattr, FileOpsRemoveXattr - FileOpsSetXattr returns -1 for EPERM/EACCES/ENOTSUP (soft failure) - Remove enable_transactional_fileops GUC (fileops always enabled) - Wire DDL callers: dbcommands.c and tablespace.c use FileOps API - Revert tablespace rmdir to immediate (XLOG_TBLSPC_DROP is sufficient) - Add test_fileops module exercising all UNDO rollback paths - Add 063_fileops_undo.pl TAP test for DDL-level UNDO rollback - Update regression expected output (guc.out, sysviews.out, fileops.out)
Adds opt-in UNDO support to the standard heap table access method.
When enabled, heap operations write UNDO records to enable physical
rollback without scanning the heap, and support UNDO-based MVCC
visibility determination.
How heap uses UNDO:
INSERT operations:
- Before inserting tuple, call PrepareXactUndoData() to reserve UNDO space
- Write UNDO record with: transaction ID, tuple TID, old tuple data (null for INSERT)
- On abort: UndoReplay() marks tuple as LP_UNUSED without heap scan
UPDATE operations:
- Write UNDO record with complete old tuple version before update
- On abort: UndoReplay() restores old tuple version from UNDO
DELETE operations:
- Write UNDO record with complete deleted tuple data
- On abort: UndoReplay() resurrects tuple from UNDO record
MVCC visibility:
- Tuples reference UNDO chain via xmin/xmax
- HeapTupleSatisfiesSnapshot() can walk UNDO chain for older versions
- Enables reconstructing tuple state as of any snapshot
Configuration:
CREATE TABLE t (...) WITH (enable_undo=on);
The enable_undo storage parameter is per-table and defaults to off for
backward compatibility. When disabled, heap behaves exactly as before.
Value proposition:
1. Faster rollback: No heap scan required, UNDO chains are sequential
- Traditional abort: Full heap scan to mark tuples invalid (O(n) random I/O)
- UNDO abort: Sequential UNDO log scan (O(n) sequential I/O, better cache locality)
2. Cleaner abort handling: UNDO records are self-contained
- No need to track which heap pages were modified
- Works across crashes (UNDO is WAL-logged)
3. Foundation for future features:
- Multi-version concurrency control without bloat
- Faster VACUUM (can discard entire UNDO segments)
- Point-in-time recovery improvements
Trade-offs:
Costs:
- Additional writes: Every DML writes both heap + UNDO (roughly 2x write amplification)
- UNDO log space: Requires space for UNDO records until no longer visible
- Complexity: New GUCs (undo_retention, max_undo_workers), monitoring needed
Benefits:
- Primarily valuable for workloads with:
- Frequent aborts (e.g., speculative execution, deadlocks)
- Long-running transactions needing old snapshots
- Hot UPDATE workloads benefiting from cleaner rollback
Not recommended for:
- Bulk load workloads (COPY: 2x write amplification without abort benefit)
- Append-only tables (rare aborts mean cost without benefit)
- Space-constrained systems (UNDO retention increases storage)
When beneficial:
- OLTP with high abort rates (>5%)
- Systems with aggressive pruning needs (frequent VACUUM)
- Workloads requiring historical visibility (audit, time-travel queries)
Integration points:
- heap_insert/update/delete call PrepareXactUndoData/InsertXactUndoData
- Heap pruning respects undo_retention to avoid discarding needed UNDO
- pg_upgrade compatibility: UNDO disabled for upgraded tables
Background workers:
- Cluster-wide UNDO has async workers for cleanup/discard of old UNDO records
- Rollback itself is synchronous (via UndoReplay() during transaction abort)
- Workers periodically trim UNDO logs based on undo_retention and snapshot visibility
This demonstrates cluster-wide UNDO in production use. Note that this
differs from per-relation logical UNDO (added in subsequent patches),
which uses per-table UNDO forks and async rollback via background
workers.
Implement UNDO resource manager for B-tree indexes and regression test. When a transaction aborts, provisionally inserted index entries are marked LP_DEAD. Includes zero_vacuum test verifying aborted inserts leave no dead tuples and index consistency via bt_index_check().
Add src/include/lib/skiplist.h (lock-free skip list using PostgreSQL pg_atomic_* primitives), src/backend/lib/sparsemap.c with headers (compressed bitmap for sparse OID/block tracking), and corresponding test modules under src/test/modules/.
Replace the three shared-memory hash tables in sLog (SLogTxnHash, SLogXidHash) with a shared-memory skip-list and compressed sparse bitmap (sparsemap), keeping SLogTupleHash unchanged. Transaction sLog changes: - Skip-list keyed by (xid, reloid) replaces SLogTxnHash. Entries are ordered by xid then reloid, enabling O(log n) lookups and efficient xid-range operations (all entries for an xid are contiguous). - Sparsemap replaces SLogXidHash for O(1) SLogXidIsPresent() checks. The hot-path presence check uses only a SpinLock-protected bitmap, avoiding the heavier LWLock entirely. - Single LWLock replaces 4 partition locks. The skip-list is lock-free by design but used in SKIPLIST_SINGLE_THREADED mode (no C11 stdatomic dependency) because the pool allocator and sparsemap require external synchronization. A single lock suffices since sLog is only modified on transaction abort. - Pool allocator in shared memory: contiguous slab of 260 slots (256 entries + 2 sentinels + 2 margin) with index-based free-list. EBR retire callback redirects node deallocation to pool free-list instead of pfree() (which would crash on shared memory). - SKIPLIST_MAX_HEIGHT reduced to 16 (supports 65K entries; pool holds at most 256). Public API signatures are preserved; callers are unaffected.
…rhead Consolidate deferred optimization changes across the UNDO subsystem: - Switch UndoLogWrite to buffer-managed pages with WAL-logged page writes - Add XactUndoContext API for transaction-level UNDO record management - Integrate HeapBulkUndo batching in nbtree UNDO for reduced overhead - Add UndoWalBatchFlush calls before UNDO log rotation points - Remove duplicate UndoRecordAddPayloadParts function - Add UNDO flush daemon sync integration at commit time
Route all UNDO record generation through WAL: DML operations embed UNDO data directly in their WAL records via the Tier 2 buffer, while batch operations use dedicated XLOG_UNDO_BATCH records. Includes all production-hardening work: TOAST UNDO support, WAL retention for UNDO replay, dead-backend cleanup, AM-agnostic Tier 2 buffer API (UndoBuffer*), InRecovery gating, performance GUCs (undo_instant_abort_threshold), benchmark infrastructure, and comprehensive test coverage.
…stic Remove the per-relation UNDO (RELUNDO) fork entirely -- cluster-wide UNDO-in-WAL supersedes it. Add robust subtransaction integration via SubXactCallback with a dynamically-grown tracking array (no fixed depth limit). Make the Tier 2 buffer AM-agnostic (renamed from HeapUndoBuffer* to UndoBuffer*) so any table AM can use it. Fix chain_prev linking for multi-table transactions. Revert FORKNAMECHARS and test_relpath() changes that were RELUNDO artifacts.
Add hash_undo.c implementing UNDO resource manager for hash indexes, mirroring the existing nbtree_undo.c pattern. On transaction abort, provisionally inserted hash index entries are marked LP_DEAD immediately during rollback, eliminating the need for VACUUM to clean up after aborted transactions. Key changes: - New hash_undo.c: UNDO record writing (INSERT subtype), apply callback (marks entry LP_DEAD), CLR generation for crash recovery, desc callback - hashinsert.c: Write UNDO record after successful hash insertion when the parent heap table has UNDO enabled (gated by UndoBufferIsActive) - undormgr.h: Add UNDO_RMID_HASH = 4 - hash.h: Add HashUndoRmgrInit/HashUndoLogInsert declarations - undo.c: Register HashUndoRmgrInit at startup The implementation follows the same pattern as nbtree UNDO: - Gated by UndoBufferIsActive(heapRel) — zero overhead when UNDO is off - Uses UndoBufferAddRecordParts for piggyback on heap UNDO batch - Falls back to standalone UndoRecordSet when no active buffer - Generates CLR WAL records for crash safety - Handles relation-dropped and block-beyond-EOF gracefully
The heap AM no longer participates in the UNDO subsystem. Benchmarks showed 33-64% DML overhead on every committed transaction while only benefiting the abort path (0.1-5% of operations). The correct architecture is a dedicated table AM (RECNO) that does in-place updates with UNDO for version reconstruction. Heap UNDO removal: - Delete heapam_undo.c (1,111 lines) and all UNDO calls from heapam.c - Remove RelationHasUndo()/RelationHasFullUndo() from heapam_handler.c - Set am_supports_undo = false in heapam_methods - Remove enable_undo GUC (UNDO infrastructure always initializes) - Remove enable_undo per-table reloption and StdRdOptUndoMode enum - Remove UNDO_RMID_HEAP; renumber remaining RMIDs - Remove UNDO code from pruneheap.c, vacuumlazy.c, tablecmds.c, utility.c - Remove heap-specific regression and recovery tests - Update index UNDO gating: nbtree/hash now use RelationAmSupportsUndo() UNDO infrastructure remains always-active (FILEOPS needs it, future RECNO will need it). Index AMs (nbtree, hash) activate UNDO automatically when the parent table AM declares am_supports_undo = true. Adversarial crash tests: - Add 10 injection points across undoapply.c, undoinsert.c, undoworker.c, xactundo.c, and fileops_undo.c - New TAP test (065_undo_adversarial_crash.pl) with 6 crash scenarios: crash during abort, during batch WAL insert, during worker discard, during FILEOPS UNDO apply, deep subtransaction chains, and repeated crashes for idempotency verification
Add the RECNO table access method, which implements in-place updates with UNDO-based version reconstruction (HLC timestamp MVCC). The sLog (secondary log) is a shared-memory partitioned hash table that tracks in-progress DML operations, replacing t_xmin/t_xmax/MultiXact in the tuple header. Key change from the original RECNO patches: replace the recno_slog_entries_per_backend GUC with a dynamic auto-sizing formula (MaxBackends * 256, clamped to [1024, 1048576]) and emergency eviction of committed entries on hash exhaustion. This eliminates the need for manual tuning -- the sLog scales automatically with the connection pool and handles overflow gracefully by evicting entries whose transactions have already committed (safe because committed tuples are visible regardless of sLog presence). The eviction path acquires all partition locks exclusively for the bounded scan (max 128 entries), which is acceptable because eviction is a rare emergency triggered only when every hash slot is occupied by in-progress transactions. Also includes: - UNDO_RMID_RECNO (4) resource manager for RECNO rollback dispatch - index_prune.c: UNDO-informed index pruning infrastructure - pageinspect RECNO page inspection functions - Comprehensive regression, recovery, and benchmark test suites
Two bugs caused standby crashes during RECNO WAL replay: 1. Cross-page defrag PANIC (test 003): The redo handler unconditionally PANICs if PageAddItem() fails on the destination page. Without a full-page image, the page may have different free space than expected during redo. Fix: force FPI on the destination buffer (REGBUF_FORCE_IMAGE) so redo restores the page from the image and never executes PageAddItem(). Add defensive WARNING + skip as belt-and-suspenders. 2. WAL consistency check failure (test 005): RecnoClearUncommittedFlags() modifies t_flags via MarkBufferDirtyHint() at commit time with no WAL record, causing primary/standby divergence. The recno_mask() function did not mask t_flags or t_commit_ts, so wal_consistency_checking detected the mismatch. Fix: mask RECNO_TUPLE_UNCOMMITTED, RECNO_TUPLE_LOCKED, and t_commit_ts in recno_mask(), following the same pattern as heap's hint-bit masking.
The xl_recno_overflow_write ovf_xlrecs[] array was declared inside the "if (overflow_buffers != NULL)" block scope, but XLogRegisterBufData() only stores pointers to the registered data — it does not copy it. XLogInsert() is called AFTER that block ends, at which point the stack-allocated array has gone out of scope and its memory may be reused, causing garbage data in the WAL record. On replay, the standby interprets garbage bytes as overflow headers, reading impossibly large data_len values (e.g., 12MB), which triggers: PANIC: RECNO INSERT redo: corrupt overflow data on block 0 Fix: move ovf_xlrecs[] to function scope so it persists until XLogInsert() reads the registered pointers.
The previous mask only cleared RECNO_TUPLE_UNCOMMITTED and RECNO_TUPLE_LOCKED bits, but other flags (DELETED, UPDATED, SPECULATIVE, HAS_INLINE_DIFF) can also be set on the primary without dedicated WAL records — particularly during VACUUM/DEFRAG operations and commit-time flag clearing in RecnoClearUncommittedFlags(). Zero the entire t_flags field in the mask function so that wal_consistency_checking ignores all flag differences between the primary's backup image and the redo result. This fixes test 005 (wal_consistency) which was failing with "inconsistent page found" during RECNO/VACUUM redo.
The VACUUM path removes dead tuples (ItemIdSetUnused) and then compacts the page before emitting a DEFRAG WAL record. However, the DEFRAG record only encodes the compaction operation, not the tuple removals. During redo, the standby calls PageRepairFragmentation() on a page that still has the dead tuples present, producing a structurally different page. Fix by using REGBUF_FORCE_IMAGE for the DEFRAG buffer registration. This ensures a full-page image is always included, so redo restores the exact post-VACUUM page state from the primary. The wal_consistency_checking comparison is also skipped (since XLogRecBlockImageApply returns true), eliminating the FATAL "inconsistent page found" error. DEFRAG WAL records occur only during VACUUM and page-full compaction, so the extra WAL volume is negligible in practice.
…y bug Merge the RECNO-specific per-tuple sLog (recno_slog.c) into the shared sLog infrastructure (src/backend/access/undo/slog.c), eliminating code duplication and establishing a single AM-agnostic tuple tracking layer. Fix a critical standby visibility bug: on hot standbys, per-tuple sLog entries were never populated during WAL replay, causing RecnoTupleVisibleHLC() to incorrectly make aborted tuples visible. When RECNO_TUPLE_UNCOMMITTED was set and SLogTupleLookupFiltered() returned 0 (always on standby), the code assumed the inserter had committed. The fix populates per-tuple sLog during INSERT redo via the new SLogTupleInsertRecovery() function, which is a recovery-safe variant that silently handles hash-full conditions without ERROR/PANIC. This allows the existing visibility logic to correctly identify aborted transactions on standbys. All 5 RECNO TAP tests now pass (108 subtests total): - 001_basic_operations (25 subtests) - 002_crash_recovery (27 subtests) - 003_replication (27 subtests) - 004_concurrent_access (16 subtests) - 005_wal_consistency (13 subtests)
The meson build already defines USE_RECNO unconditionally. Add the same to the autoconf build system via Makefile.global.in so that RECNO is always compiled on the undo branch regardless of build system used. - Add USE_RECNO=1 and -DUSE_RECNO to Makefile.global.in CPPFLAGS - Add recnodesc.o to rmgrdesc/Makefile (conditional on USE_RECNO) - Guard PG_RMGR entry in rmgrlist.h with #ifdef USE_RECNO
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.