Search This Blog

Sunday, October 10, 2021

pbctf 2021 Nightclub Writeup: More Fun with Linux Kernel Heap Notes!

This weekend, I played with DiceGang in pbctf 2021 and we ended up placing 1st. There were so many cool (and absolutely insane) challenges, but school was really busy so I was unable to play as much as I liked. However, I still worked on some challenges and solved the kernel pwnable NightClub by IKEA with noopnoop, and thought it would be nice to make a writeup about it. Feel free to let me know if anything is unclear or incorrect.

In some of our initial analysis, we notice that SMAP, SMEP, and KASLR are turned on, and we are on version 5.14.1. In terms of important kernel hardening options, the SLUB allocator is used without randomization or hardening, and usercopy is not hardened; userfaultfd was disabled as well (but didn't really matter as unprivileged userfaultfd has been disabled by default since 5.11). The nightclub driver itself was a generic character device, and can be interfaced with via ioctls. 

In this driver, there are two main structs: a user request struct and the note struct itself. The note struct in kernel land itself was 0x80 in size, placing it under the kmalloc-128 slab, and a lot of useless padding was placed throughout just to make it more difficult for players to exploit.

Four functions are also provided. If you make an ioctl request with 0xcafeb001, an allocation is made. You specify the size of data to copy in (which is capped up to 0x20), and an offset and 0x10 bytes of “special” data to copy in (both of which is never used again for some reason). The kernel also generates 4 random bytes for the key to store in the note, and returns it to the user. After setting up the struct, it is linked into a doubly linked list with the head located in the module data section with the symbol “master_list." When copying data over, alloc sets a null byte at note->data[size], leading to a trivial null byte poisoning bug.

An ioctl request with 0xcafeb002 triggers a deletion. The user specifies a key to delete a specified note from the list. The driver unlinks it and sets the pointers of the note itself to poison constants before freeing. Edit allows you to specify a key, data, size, and a offset (which is capped at 0x10). This leads to a trivial heap overflow and also has a poison null byte bug afterwards.

Note that in all the functions above, the note pointers during linked list traversal or kmalloc return are checked to be in range of a few constants. Noopnoop and I didn't dig too deeply and aren't completely sure, but just assumed those acted as a pseudo "address sanitizer."

Lastly, 0xcafeb004 is a “leak” function. It just leaks the difference between the addresses of edit_ioctl in the module and __kmalloc from the kernel text itself. While information from this paper and page about Linux memory mappings show that kernel text and modules have known regions (and the first only has 9 bits of entropy with KASLR and the latter only has 10 bits), the difference still doesn't provide us with enough information to avoid bruteforcing, which isn't feasible with remote POW.

Without SLAB randomization or hardening, this challenge should be pretty trivial with the overflow right?

Sadly, the answer was no. Not only was there no easy way to achieve leaks in this driver, but in recent Linux kernel versions, the freelist pointer in SLUB chunks have been moved into the middle of chunks (so a 0x10 byte overflow cannot reach it in kmalloc-128 chunks). In our case, it is at offset 0x40 upon freeing. We could also attempt to attack via msg_msg structure, but arb write would be quite difficult without userfaultfd, and we cannot hit the size and next fields (located at offset 0x18 and 0x20). There is also lack of well documented structs for exploitation in kmalloc-128. subprocess_info used to be in this slab, but has since moved to kmalloc-96; I went through both of these papers (paper 1, paper 2) and was not able to find a useful struct given our overflow limits.

Our first breakthrough was noopnoop's heap massage. He massaged the heap in the following manner to achieve a UAF chunk in the linked list.

First, we want to create a linked list chain of consecutive chunks in the following order (only the arrow in the direction of traversal is showed for simplicity's purposes, but keep in mind that it is a doubly linked list).

Then we poison null byte the next pointer of the second chunk, causing the next pointer to now point to chunk III.

  
Upon freeing the third, we will have UAF capabiltiies.

Even with this massage, there is not much you can do due to the nature of where we can write. Double freeing would impose new issues because of the poison pointers set from unlinking (we do not have leaks to overwrite them with valid pointers). A brief idea we had was to replace the UAF chunk with a msg_msg object, and link it into another slab; then by adding more msg_msg objects of different sizes, we can have the driver's linked list traversal end up in another slab. However, this was unfeasible at this point because we cannot apply this to overwrite other chunks in larger slabs, so the only option was to target kmalloc-64 with this technique (we did not know that kmalloc-96 existed at this point), but the writeable offsets combined with the chunk alignment made it difficult to find a structure to target properly.

Our second breakthrough came once I realized that kmalloc-96 existed (we somehow both thought it didn't exist); I guess the moral of the story here is to always check /proc/slabinfo. With this knowledge, we managed to solve pretty quickly via the following methodology.

First ,we performed the same heap massage mechanism to achieve a freed chunk in a linked list. Then, we replaced that chunk with a msg_msg object in the kmalloc-128 slab, and added two more msg_msg objects in the kmalloc-96 slab. This way, you get msg_msg objects (along with the msg_queue in kmalloc-64) linked into your list.


Since we control the contents of the msg_msg object over where the key is stored in the note structs, we can easily lead the driver to find the location we want. Due to the misaligned nature between 0x60 sized objects and 0x80 objects, we can perfectly align the OOB writes with offsets to change the size field of the second msg_msg object in kmalloc-96 as I discussed in my last post. Now, we can allocate a subprocess_info struct by triggering call_usermodehelper_setup from module autoloading, which as many previous research posts have been discussed, can just be triggered with socket(22, AF_INET, 0);

Take note that the offset write is really important. If you just do a generic overflow, you will end up corrupting msg_msg pointers. This kernel did not have the checkpoint_restore option compiled in, so MSG_COPY will not work, and any attempt to unlink without known kernel addresses will lead to a panic. Now, if we message receive from our queue, we can achieve kernel leaks!


Now, with a leak, the most direct approach is to achieve a UAF overlap as done previously, create a double free, and freelist poison to modprobe_path to bypass SMAP. I sprayed a lot more chunks to help me reach a cleaner region, and then redid the same unlinking approach. However, after freeing chunk 3, I used chunk 2 to overflow into chunk 3 and fix its pointers to go back to the master_list. This way, upon the next free, we will not have a kernel panic. The address for master_list can be trivially calculated after a kernel leak combined with the 4th ioctl option.

We now freed another chunk in kmalloc-128 (I saved a msg_msg object in another msg_queue for this), then freed the UAF'd note in our linked list again. With the LIFO behavior of the kernel allocator, I allocated a new msg_msg object, overwrote the freelist pointer with modprobe_path, allocated 2 more messages, and finally allocated a new msg_msg object to then overwrite modprobe. Any attempt to execute an invalid executable will trigger the kernel execution of the program specified in modprobe_path, effectively giving us RCE with root privileges.

Here's the final exploit with comments:


And here's our success on remote!


Afterwards, I talked to the author and another player (pql) and discovered that noopnoop and I completely unintended this challenge. We just assumed that a limited OOB combined with a poison null byte was enough. The sections of the code that we brushed aside (and labeled a pseudo ASAN) actually provided a primitive to bruteforce and probe for both kernel and module base via partial overwrites on a given chunk's prev address.

Overall, I had a lot of fun with pbctf and working with noopnoop and the rest of DiceGang. I am definitely looking forwards to pbctf 2022, and hope I have more time next year to play even more challenges! 


Thursday, August 26, 2021

corCTF 2021 Fire of Salvation Writeup: Utilizing msg_msg Objects for Arbitrary Read and Arbitrary Write in the Linux Kernel

In corCTF 2021, D3v17 and I wrote two kernel challenges utilizing a technique that is novel at least to our knowledge to gain arb read and arb write in kernel land: Fire of Salvation and Wall of Perdition. A famous kernel object often abused for heap sprays is the msg_msg struct, which is an elastic kernel object meant for IPC purposes (System V message queues) that has a size ranging from the kmalloc 64 to the kmalloc 4k. There was also a recent CVE exploit writeup by Linux kernel developer and security researcher Alexander Popov in which he abused msg_msg for arb read in his exploit for CVE-2021-26708. D3v17 and I read this, and posed the question to ourselves, is it possible to achieve arbitrary write in this across any valid slab for msg_msg? After a week or two of digging around, not only did we discover a way to achieve arb write on kmalloc 4k slabs, but we also discovered a way to do this for any valid msg_msg slab. In this post, I'll detail the Fire of Salvation writeup, which covers arb write on kmalloc 4k slabs. I'll also provide a tldr with the insights for any valid msg_msg arb write with a summary of my approach for the second challenge, but D3v17 will detail that out in his post for Wall of Perdition. Since this writeup is quite long, feel free to let me know of any unclear explanations or mistakes.


In this challenge, the following key protections were enabled on a 5.8 kernel: FG-KASLR, SLAB_RANDOM, SLAB_HARDENED, and STATIC_USERMODE_HELPER. The SLAB allocator was also being used, with a corresponding kernel.config file provided with all the extra other tidbits and miscellaneous hardening options (such as enabling the userfaultfd syscall, hardened_usercopy, CHECKPOINT_RESTORE, etc.). SMAP, SMEP, and KPTI being on was a given. Also, since our goal was to introduce players to a novel exploitation technique, we didn't really care much about the reversing procedure to find a bug, and didn't want to complexify the bug. In our discussions, we decided to just make the bug a pretty obvious UAF that limits them to around 0x28 to 0x30 of UAF write. (no UAF read). This was the source we provided to all the players (we were also nice enough to give out a vmlinux with debug symbols and structs):


To summarize, a simple firewall driver was created, with a separate array for inbound rules and outbound rules (confined to ipv4). From userland, people can interact with the netfilter hooks via a misc device, using the user_rule_t struct, which gets transfered to the rule_t struct to go into the kmalloc-4k slab. Each rule is allocated in the array via kzalloc, and contains information about rule name, interface name, IP address, port, the action to take, the protocol (TCP or UDP), a duplication flag, and a large buffer for description which one cannot modify after allocating it. Most of these fields have validity checks (actions are only limited to DROP or ACCEPT), and add, edit, and delete rule are both safe. The only bug is in duplicate, which duplicates a rule from inbound to outbound or vice versa; an obvious UAF scenario occurs when you duplicate a rule from array 1 to array 2, and then delete it from one array without deleting it from the other.

Exploitation wise, there are a few serious roadblocks that would prevent common exploit paths and necessitate the need for good arb read and arb write primitivies. The fact that it is using the SLAB allocator means that no freelist pointer will be on the chunks themselves (and even if they were, they probably won't be within the 0x30 UAF region as the Linux kernel have moved them down for certain slabs). FG-KASLR will complicate the ability to overwrite function pointers (such as the one on the sk_buff struct's destructor arg callback in the CVE writeup), as most gadgets not in the earlier parts of .text will be affected; ROP is still possible, but I believe that would entail first arb reading the ksymtab for the function for whichever the gadget is relative to. Lastly, with STATIC_USERMODE_HELPER (and its path set to “”), the classic SMAP bypasses of targeting modprobe_path or core_pattern no longer work. The path itself is now located in a read only section of the kernel according to readelf. At this point, the most direct way to then bypass SMAP is to probably arb read the doubly linked list of task structures to find the current task, and overwrite the cred pointer to one that would give us root privileges. A physmap spray would be another common approach, but that's just painful.

Do note again that the vulnerability only gives a small window of UAF write, without UAF read. Once you allocate a few chunks to help smooth out the SLAB shuffling on the current slab, we can begin the exploitation procedure. Let's first take a detour into msg_msg (these manpages can be quite helpful). I do consistently use the IPC_NOWAIT option to avoid hangs and a msgtyp of zero to pull from the front of the msg_queue. For reference, here is the msg_msg struct:


In our case, the security pointer will always be null, since there is no SELinux.

Looking at do_msgsnd:


Before enqueing this msg onto the specified message queue, load_msg is called to usercopy data into the chunk.


Note how it calls alloc_msg to allocate space on the kernel heap for the incoming message beforehand.


msg_msgseg is simply a struct that has a next pointer with data following immediately afterwards. Both DATALEN_MSG and DATALEN_SEG are basically page size minus the size of their respective structs, so their maximum size fits exactly in the kmalloc 4k slabs. Once you send in larger messages, a linked list is created. The maximum size of a message is determined by /proc/sys/kernel/msgmax, and its default is 8192 bytes (so you can get a linked list of 3 elements at most).

Now, let's take a look at do_msgrcv:

If MSG_COPY flag is set (which is availble with CONFIG_CHECKPOINT_RESTORE), it calls prepare_copy, which is just a wrapper around load_msg to prepare a copy (interestingly enough, doesn't using that make it also usercopy whatever is in the userland buffer for recieving into the allocated copy first?). Without that flag, it simply traverses the queue and will unlink the msg after finding it with find_msg. Otherwise an unlink doesn't happen. copy_msg is called, and data is transferred to the copy via a memcpy (note the memcpy, this is very important!).


If no errors have happened yet, it reaches the msg_handler call, which is actually do_msg_fill if you trace the code path from the call to do_msgrcv. do_msg_fill is basically a wrapper around  store_msg, which does the following:


Basically, it traverses the linked list structure of the msg_msg object, and uses usercopy to bring the desired message back to userspace. Then, do_msgrcv calls free_msg, which frees the linked list structure in order (starting from head, then next, etc.).

Now, let us think about abusing a UAF over these elastic objects for arb read and arb write. By modifying the next pointer or the size, arb read via msg_msg should be quite trivial, except for the fact that unlinking it from the queue (unless you can somehow skip modifying the first few qwords which you can't in this challenge) would destroy it. You can try to modify size so maybe it can leak more data from the next segment of the msg_msg object, but hardened usercopy would stop you dead in your tracks. However, this is where MSG_COPY comes into play. Not only does it not unlink your message, but it also uses memcpy for the initial copying of data! So now, we can happily modify the next pointer and change the m_ts field. This technique has already been documented in Popov's CVE writeup. The only restriction is that your next segment has to start with a null qword to avoid kernel panics or having it go somewhere you do not want it to go to.

How would we approach arb write then? This is where every Linux kernel exploit developer's good friend userfaultfd comes back (rip to the new unprivileged userfaultfd settings from 5.11 and forwards). During the msgsnd process, if you manage to have a UAF over the first part of the msg_msg object, you can have it copy over data for a message request that requires more than just one allocation. Then, if you abuse userfaultfd to hang the copy right before it pulls the value of the next pointer (such as when it's a few bytes away from copying everything into the first chunk), you can use the UAF to change this next pointer, and you can achieve arbitrary write once you release the hang! Of course, just like arb read, this requires the target region to start with a null qword. To make this clearer, take a look at the following diagram:


Now with an understanding of both primitives, what will the exploitation path be from where we left off? Well, as I discussed in my hashbrown writeup from diceCTF, to get a kernel base leak with FG-KASLR on, we need to rely on pointers into kernel data as those are not affected. It's as simple as just spraying a ton of shm_file_data objects in kmalloc-32 (which has pointers to kernel data such as in the init_ipc_ns field), and allocate a msg_msg such that it will take up a 4k slab and a kmalloc-32 slab as its next segment that is under our UAF's control. Then we can expand the size (without causing it to traverse more than once), and use MSG_COPY to get a OOB read in the kmalloc 32 slabs and achieve leaks. Of course, before we do that, we have to make sure our data sent via the ioctl follows the format (ip, netmask, etc.), but that is trivial to implement. 


Then, we can re-abuse this arb read to read the task_struct linked list starting from init_task. Since our exploit is probably the latest process running on a generally idle qemu system, we can just walk from prev and hit our task_struct pretty quickly. While a normal trick to find offsets in task_struct is to use prctl SET_NAME to set the comm member (as ptr-yudai detailed in his Google CTF Quals 2021 Fullchain writeup), we provided vmlinux with debugging symbols and structs. The task doubly linked list is located at an offset of 0x298, with a consistently null qword beforehand, the pid is at offset 0x398 (which is close enough for this region to be treated as a single msg segment), while the real_cred and cred pointers are at offset 0x538 and 0x540, also luckily with a nice null qword beforehand. 


Once we find the current process based on the pid in task structs, we can just arb write to the real_cred and cred pointer, replacing them with init_cred and effectively pwning the system.


Here is my exploit for reference:


With this exploit, you can reliably pop a root shell and allow us to read the flag:

For some reason, musl wasn't working well with threads on the qemu, so I had to use gcc static. Luckily, upx packer managed to cut the size down by more than 60%, though that is still much larger than a musl compiled exploit. Congratulations to Maher for taking the unofficial post-CTF first blood on this challenge. A shout-out goes to team SuperGuesser as well, I heard they were quite close to solving during the CTF as well.

As for the Wall of Perdition challenge, I will only provide a brief summary of my solution, which I believe might slightly differ from D3v17's. His post on this part will be much more detailed, with many more diagrams to come.

In this second part, the sizes are limited to kmalloc-64; I wasn't aware of any commonly known abusable structures in this range. While arb read is still quite trivial thanks to MSG_COPY, using it to get a kernel base leak with FG-KASLR is not as easy. Arb write becomes even harder as well. 

To get the same shm_file_data kernel leak, I first allocated two message queues, and I would like to find the address of one of them (I will call this one front queue). Finding its address is easy, as we can just spray msg_msg structs in the same queue (beware that there is a limit set by /proc/sys/kernel/msgmnb that defaults to 16384 bytes total per queue), and abuse OOB read via MSG_COPY to check for known msg_msg contents from the spray. 

This leak will be useful to avoid crashes and prematurely stop the traversal in the msg_msg queue linked list during do_msgrcv when MSG_COPY isn't set. Then I sprayed more msg_queue allocations, and sent in a message for each, as I am trying to look for a message and queue I can reach with the OOB read from the front queue in kmalloc 64 slabs. From here, I can send in a msg_msg object that chains a 4k chunk with a kmalloc 32 chunk into this target queue, and leak its address based on OOB reading the linked list structures of its previous kmalloc 64 msg_msg object. 

At this point, we have to pray that the qword before that 4k chunk is null. If so, replacing the next pointer of the UAF'd chunk in the front queue to that location will allow us to get the address of the kmalloc 32 chunk. After spraying many more shm_file_data objects, we can just expand the msg_msg struct under our UAF's control to a much larger size, replace this very object's next pointer to the address of the kmalloc 32 chunk, and just dump huge portions of the kmalloc 32 with MSG_COPY for a kernel base leak. From here on out, arb reading the task struct to find the current struct is pretty much the same as in the previous exploit.

Now, for arb write, I reused the target queue earlier, cleared the large msg out from it, and replaced it with a msg_msg object that has two 4k chunks in its chain (this is getting quite close to the default msg_msg size limit). I can abuse the previous technique to then leak the address of this new 4k msg_msg object along with the address of its msg segment. Then, I freed this large object with msgrcv (and we will get them back in the order of leaked segment address and then the leaked msg_msg object due to LIFO). I msgsnd again a size of a message that requires two 4k chunks, and hang it with userfaultfd on load_msg, and quickly arb free its segment via the msg_msg under UAF control in the front queue via msgrcv. No crashes will occur since I fixed its pointers to go right back to the front queue itself.

Upon this arb free primitive, I send another message in another message queue that was previously allocated; it will also be hanged when reading in userland data. The object will just be of enough size to cover a 4k chunk (to get back the last freed chunk due to LIFO) chained with a small segment linked along. I let the data transfer from the original hang to continue, which gives me the ability to overwrite the next pointer of the currently allocated msg_msg object, thereby giving me arb write once I let this second hang continue and finish off. This might seem quite insane, but I promise you that D3v17's blogpost will make this quite clear with his diagrams.

Here is my final exploit, which only had about a 50% success rate.

Running this remotely should eventually give us root privs and the flag:

Congratulations to jass93 of RPISEC for taking the unofficial post-CTF first blood on this challenge. The player didn't exactly go down the intended route of abusing an arb free primitive, but rather went on to abuse the unlinking mechanism in msgrcv, which seems quite powerful too. Overall, D3v17 and I learned a lot from developing these challenges, and believe that these primitives will be quite applicable on most kernel versions given a somewhat sizeable UAF. The changing in default unpriviledged userfaultfd permissions definitely deprecate these primitives, and the potential idea of a SLAB quarantine mechanism would render this technique pretty much useless for non 4k slab sizes due to its reliance on LIFO behavior. Another really interesting kernel hardening measure we accidentally stumbled upon was usercopy whitelisting, in which slab regions have whitelists on what usercopy can interact with. Luckily, fallback is enabled by default and on most Linux distros, but if it was enabled, our SMAP bypass methodology would have failed. Thankfully modprobe_path and core_pattern still exists, but we wonder what other SMAP bypass techniques would be just as simple and elegant. Hopefully, you would have learned a lot from this writeup, and make sure to read D3v17's in depth and amazing writeup for Wall of Perdition!

corCTF 2021 vmquack writeup: Writing a Custom Binary Ninja Architecture Plugin to Devirtualize a Crackme

vmquack was a CISC VM reversing challenge I wrote for corCTF 2021 loosely based on x86_64 as I have been expanding my skillset beyond just pwn recently. It utilized a few anti-debugging tricks inspired by this pdf, and just for fun, I decided to play around with Binary Ninja's API to write a disassembler for the vmcode with graph view, and even lift it to have decompilation capabilities. Since this is a reversing challenge (and as an author, there is a ton of authorship bias since I'm basically reversing something I wrote source for), I will do my best to present the important pieces of information and try to present it in a way to replicate how someone would approach this challenge.

Initial analysis lets us know that this is a stripped static ELF. Though often seen as an annoyance in reversing challenges, I pretty much had to link it statically to get the crackme to work in a stable manner across different distros. Regardless, my challenge made very little use of library functions and main is trivial to get to from the entry point. Running the binary shows the following:


Trying to run it in gdb results in the following:

Running in strace results in the following:

Already, there is some anti-debug, but these are pretty trivial to bypass - they are in the constructors before main which you can find addresses for in the .init_array. The first one is just setting up another trap signal handler, and since gdb won't use it, the program goes down an alternative execution path that leads to an exit. The second one is just a classic fork, PTRACE_ATTACH PTRACE_ATTACH PTRACE_TRACEME PTRACE_TRACEME anti-debug. If any of these calls return an unexpected value (such as -1 on the first call or 0 on the second call of the same ptrace option), it sends a sigkill signal to the parent thread (the prctl(PR_SET_TRACER, PR_SET_PTRACER_ANY) was needed to ignore the default ptrace_scope values in yama LSM). These mechanisms are quite easy to bypass, but I also revirtualized the second anti-debug technique with 2 more anti-debug/patching techniques hidden in the virtualized code region, which are seen as quack_quack_data at 0x13370000 and quack_quack_text at 0x31337000 in memory and elfparsers.

Looking at main at 0x402d01, another function that is commonly used is 0x402dc7, which is what writes out the program strings (all the program strings were encrypted by an 8 byte xor key and stored in a struct containing the length, the 8 byte xor key, and a pointer to the encrypted string). 0x40299e is the beginning of the vm, and afterwards, the program checks the return value. If it is 0, it is considered a success, else a failure.


Now, we can start reversing the vm handler itself. I'll just highlight a few important parts and emphasize important parts of the logic rather than going through every single handler. It's important to note that the vm handlers were all written with handwritten inline assembly, so a decompiler won't be of much use here.

Looking at the vm start:


As you might have seen in main, rdi carries the location of the vm_context struct, with rsi acting as a pointer to the vmcode, rdx being a pointer to the mmap'd region (which acts as the vm stack), with the next argument representing the vm stack size. As mentioned earlier, this VM is based on x86_64, so the offsets in the context struct correspond to the vm registers that represent their x86_64 equivalent (even the rflags register is included). Although unclear for now, offset 0 of the context struct is a spot for immediate operands.

The rsi register is used as the vm instruction pointer, r15 points to the dispatch table of handlers, and rdi points to the vm context. The following information will become apparent later on as I go through more vm handlers, but rax is utilized as an idx for jump tables and sometimes as a temporary variable, rbx is used to hold information about instruction operating size and addressing modes, r8 and r9 are generally used for destination and source for vm opcode handling, and r10 to r14 are utilized as temporary variables. vmquack supports immediate to register, immediate to memory (where memory address is in a register), register to register, register to memory, memory to register, and memory to memory addressing modes, which is different from x86_64. Operands must match in size, with supported sizes being BYTE, WORD, DWORD, and QWORD.

Now, let's look at the main dispatch handler:


It uses the lodsb instruction to automatically adjust rsi and store the instruction opcode into rax (effectively adjusting our vm rip for us). Notice how it does check the vm opcode is within the dispatch table; if it fails, it jumps into 0x401c71, which is just a ud2 instruction. This jump to ud2 theme is repeated throughout to help guide reversers to understand valid instruction formats.

Now, let's look through the instruction handlers. I'll start by detailing some of the special handlers.

Opcode 0:


At 0x402a14, it just restores all the pushed registers in the vm starting handler, and then returns. This represents an halt opcode, and allows the main program to know the return value of the vm based on the virtual RAX register.

Opcode 1:


This is a syscall handler. This can also be a great way to help us identify and confirm our hypothesis about which offset in the vm context matches with which x86_64 register (the entire VM is basically baby x86_64 with some differences in terms of what operands are expected and valid addressing modes). The return value of the syscall is also stored, before jumping back to the main dispatch handler.

Opcode 3:


This function moves sets the real registers accordingly as if it's going to make a function call according to SYSVABI convention. It also reads a byte after the instruction opcode and treats it as an index. The stack is changed before being restored back later on (for additional arguments that go beyond the number of given registers for function call arguments, although alignment isn't guaranteed here). It also makes a call based on a function table at 0x4b0be0; this is basically a call of the real program's functions. Going through the function table based on known constants and strings, index 0 is for a function that sha256sum verifies portions of the binary against patching, index 1 is malloc, index 2 is free, and index 3 invovles some AES decryption (as you will see later, this is called with your input as arguments when the flag checker passes to give you the flag). Invalid indices do not go to a ud2 in this case, but rather just returns -1 to the vm.

The last special opcode is opcode 32. This does nothing but jumps back to the main_handler, hence it is just a nop.

Before I continue looking at other opcodes, there are a few operand handlers that I would like to discuss.

At 0x401dfc, we see the start of the following function (zoomed in as the graph view is very large):


From this instruction, we can determine that the byte proceeding after most opcodes contain information about the addressing modes (as well as size which you will see later on). Take note how addresses are constantly pushed, then a jump to elsewhere happens. At most of those places, it will end at a ret to return back to that pushed location in the previous handler. I wasn't sure if there was terminology for this, but Asphyxia mentioned to me while working on this that this is obfuscation with defer-like control flow.

We notice that if the first bit is set in the second byte, it jumps to a region that looks like this:


Based on the bits set in the previously consumed byte, it either reads in a byte, a word, a dword, or a qword (as the vm instruction length is variadic). This is the vm consuming some intermediate value. It also moves the value into the r9 register. The value is stored in the immediate register offset in the vm context when it returns to 0x401d7f and r9 will hold the pointer to that location (this will help with the consistency for operating between immediate/register/memory for instruction handlers). If it didn't go into this immediate branch, it goes and consumes a register, which it determines by consuming the next byte, decrementing, and comparing (the correct offset's address will then be loaded into rax, which then gets transferred to r9):


Do note that before the register selection, it also determines whether this register will be used as just a register, or as a pointer to memory. It does so in the push of the address onto the stack which the register cases then return too.


In this case, 0x401d9e represents the case of a plain register, while 0x401dba represents a memory operand based on the register (their disassembly makes that very clear).

Lastly, notice how at the beginning of this handler, the push will make it return 0x401e32, which just sets r8 equal to r9 (r9 held the value of intermediate/register/memory), before returning to the opcode handler that called it. This entire handler is just for determining the addressing modes used for unary operands.

Another commonly used operand handler is at 0x401e3c. It basically does similar things, but only handles registers (and registers pointing to memory) with the 3rd byte in the instruction, and stores results into the r8 register. It's initial push however causes it to return to 0x401e63, which behaves the same as unary handler on the 4th byte (and allows for immediates as well). However, this time the values in r9 aren't moved into r8, and then after these handlers finish, it goes back to the opcode handlers that called it. Overall, this is a binary opcode operand handler. As you will see later on, r8 holds the destination, while r9 holds the source equivalent in x86_64 during the instruction handlers.

The last used operand handler for instructions is at 0x401e9c which does the same thing as the first part in the binary opcode handler, but then goes into 0x401ec7, which just lodsb a byte and moves it into rcx, and then returns into handlers. As one will see later (or can perhaps even guess based on x86_64 behavior), this is a shift opcode handler (but only 1 byte immediates are allowed for source operand).

From the above conditionals for register selection and size, we can determine the following information. The second byte holds addressing mode and size information. The 0th bit represents the use of an immediate source, the 1st bit represents the use of a register source, the 2nd bit represents the use of a memory source, the 3rd bit represents the use of a register destination, the 4th bit represents the use of a memory destination, the 5th bit represents byte sized operations, the 6th bit represents word sized operations, the 7th bit represents dword sized operations, and the default operation size is for qwords. Moreover, when a byte is used to determine the register used for register or memory referencing based on register value, the following values are used: 0 is RAX, 1 is RBX, 2 is RCX, 3 is RDX, 4 is RSI, 5 is RDI, 6 is R8, 7 is R9, 8 is R10, 9 is R11, 10 is R12, 11 is R13, 12 is R114, 13 is R115, 14 is RBP, 15 is RSP, and 16 is RFLAGS.

With these operand handlers for opcodes finished, we can go back to the instruction handlers.

Opcode 2:


This opcode calls the unary operand handler, and only accepts either a qword register or memory operand, or a byte or dword operand. When taking a register or memory operand, it directly sets the vm rip to it. In the other case, it adds it after sign extending. Also note that other cases lead to a ud2, and that stack offset is decremented by 8, with the current vm rip opcode (this will be pointing to the next vm instruction from the current one as rsi is adjusted from lodsb) stored in it. This is a vm call instruction. Opcode 7 is a vm jmp, and does the same thing (except for the push of the next instruction address.

Opcodes 8 to 13 are conditional jumps, and only take relative immediate byte or dword offsets from the next vm instruction. It also stores the real program's flags, before using the vm's flags register to make the determination of the jump before restoring the original flags.


  Opcode 4 is another important one:

If you remember opcode 2, this is popping from the vm stack and restoring the vm rip to it, making this a ret instruction.

Opcode 5 and 6 are quite important. Here is opcode 5:

        
Notice the stack adjustment, and how it stores the result from the unary handler onto the vm stack. If the operating size is a qword, then immediates aren't allowed. Otherwise, only immediates are allowed and they are sign extended. This is the vm push instruction

Opcode 6 you can probably guess:

     
Only the qword operand size is allowed, the stack is deremented with the value stored in the specified unary operand. This is a vm pop. Technically, popping into the immediates is allowed by this logic, but I just never did that in my vm code :^)

Opcode 30 is another key handler.

 
This is the vm mov instruction, which the vmcode itself ended using a lot.

Lastly, I will go over 2 more instruction handlers, one unary instruction and one binary instruction.


This is a vm dec instruction. Notice how the vm flags are set accordingly by switching the program's rflags register to the vm's rflags register, and then restored back accordingly. Also take note of the extra branching for when the case the operand size is dword and the destination operand is a register, as it behaves the same as x86_64 and zeroes out the upper 32 bits.

Opcode 22 is an example of a binary operation.


This is a vm xor, with the source being in r9 and the destination being in r8. Notice how it sets the vm flags and handles register destination of size dwords similarly to the previous unary operation.

The rest of the vm handlers are pretty similar, as knowing what they do is quite obvious (since the instruction is in the handler). A few noteworthy exceptions is that vm_shl and vm_shr use the shift operand handlers as mentioned earlier, and that mul, imul, div, and idiv handle the vm behavior just like how it would handle it in x86_64 (with the exception that now imul and idiv only have the one operand form). There are probably a few bugs throughout the vm handlers (such as in idiv) because writing a sizable vm reversing challenge in manual assembly probably isn't the best thing to do a few days before a CTF starts.

At this point, the handlers should be marked accordingly:


Now, we can write a disassembler to help us figure out what the vmcode is doing. For this section, I decided to experiment with Binary Ninja's API and wrote a crude plugin (source linked at the end of this post); my teammate hgarrereyn on DiceGang always makes Binary Ninja plugins for vm reversing challenges, so this was where I took some my inspiration from. To make my plugin work seamlessly on the dump of vmdata and vmcode data I made, I just broke it into a file format starting with the string vmquack, followed by addresses and respective lengths of each vm section. Disassembly, control flow graphs, binary view, and section permissions are all detailed with this wonderful post from Binary Ninja. As for the rest, you pretty much have to rely on other Binary Ninja plugins (such as their Z80 plugin and a community riscv plugin) with their python api doc. For most of the LLIL lift, you can rely on this page, and just append LLIL instructions to the given IL array for the get_instruction_low_level_il method in the custom architecture class. When implementing the lift, I found it helpful to break each operand into its own LLIL expression, before applying the overall LLIL expression on top. One thing to note is about flags lifting. I didn't do this part very accurately (just enough to generally work), but Binary Ninja has some very basic documentation for it in their dev branch of release. You can create your own (which I didn't) by setting a SpecialFlagRole and overriding behavior in get_flag_write_low_level_il, but they also do handling for some common flags for very few instructions (add, sub, neg). Applying calling convention is another important task for HLIL, but the docs for those are quite self-explanatory; I just set a default calling convention and one for syscalls.

Now with the plugin, I'll provide a brief overview of the VM (graph view and the ability to look between LLIL, MLIL, and decomp can make reversing much faster, especially with all the extra normal features of Binary Ninja such as xrefs and compiler optimization, despite this already being a relatively simple vm program).


The vm starts off at the entry point calling main. The next few following functions are all syscall handlers. 0x31337008 is exit, 0x31337012 is kill, 0x31337026 is write, 0x31337051 is read, and so forth. There is one special case at 0x31337030, which just writes a newline.


Then a series of simple string/memory functions are defined.


HLIL is turning out to be decently accurate! Just to show case graph view as well, here is what strlen looks like:


Binary Ninja's graph view also works in all other modes including LLIL, MLIL, and decomp.

malloc and free are the two functions following strlen (as evidenced by their extern call). Technically, the behavior of this vm's malloc is really calloc. Following those two functions is a nanosleep function and a prctl function to ignore ptrace_scope rules that the vm utilizes throughout with the virtualized anti debug i mentioned earlier.

Before continuing, there are 2 more important functions to address: 0x31337644 and 0x31337724.
The first is just a simple xor decryption routine, given the previously mentioned encrypted message struct in the program itself:


The second function is just a wrapper around this, that also zeroes it out post decryption and frees it.


At this point, there are also 3 anti-debug functions that get utilized throughout.


This calls the real program's verification function to ensure that patching didn't occur.


This above (the entire thing isn't shown) is just a virtualized version of the basic PTRACE_ATTACH pair with PTRACE_TRACEME pair anti debug in the constructors. The other anti-debug (at 0x3133718e) is more special. While it does a similar thing, do take a careful look at the very end


Notice how it's retrieving the regs of the parent process, modifying the R15 register, and then setting the registers of the parent process. Back in the vm syscall handler, you might have noticed how rsi was stored into r15 and then restored back after the syscall. In this case, this anti debug measure is modifying the control flow of the parent process once it's waiting for the child process to exit via wait4 by incrementing the vm rip to skip right to the return rather than decrypting another failure message and exiting.

Before we discuss main now, I will discuss some of the functions used for flag checking. At 0x31337930, the vm initializes some values in the vmdata. I'm calling these state1, state2, and rounds for reasons that will become clear soon.


The following function then uses those global variables:


This is basically an xorshift128p prng, that has an added factor of a rounds variable of byte size to keep it looping based on the results from the pseudo random number generated.


If you xor decrypt with the given constants on the variables in the function stack frame, a quick search should lead you to the fnv1a64 hash function, which is a non cryptographic hash function.

The following is the primary function that is used to transform given input.


It iniitalizes the seed, and then increments a counter by 8 64 times, with our input being the first argument to the function. Each time in the loop, the following happens:


It xors an 8 byte value from an array at 0x13370520 with 0xf33dbeef1337beef, and divides it by 0x7baaadd15ea5e5. It does this for two consecutive 8 byte values and then builds a 2 byte character array by treating the results of the previous arithmetic operations as indices to send to the fnv1a64 hash function, storing the 8 byte result in a uint64_t array at 0x13370420, and then xors that result with the return value from the prng. If one were to apply the “decryption” algorithm on the “index” array, one may realize that the values only range from 0 to 63 but are shuffled.

Anyways, all that is left is now is the main function.


This first part of main is quite self explanatory. Notice how it ensures that your strlen is 0x40 (which matches up with the observation of the “decrypted” indices from earlier on).


Then it takes the transformed input array and compares it to a resultant array (after xoring each 8 bytes with 0x2d969360de674bbb). If all the comparison is good, it calls the flag giver function from the real program and ends.

By extracting our knowledge from this vmcode and data from the vmdata section, we can come up with the following function to generate the correct input:


This results in [DUCK.IO]::QUACK::TO_HUMAN("QUACKITY QUACKME MOTHERQUACKERS!!");

Entering this passes the checks and finally gets us the flag!


Overall, the solvers had fun in this challenge, and I definitely enjoyed seeing just how powerful the Binary Ninja API can be (I have utilized it a few times before, most recently on the 2k vm challenge from redpwn ctf 2021). The link to the rough plugin (that is not completely accurate) is here for you to play with. A huge shoutout must also go to my teammate st4ckh0und for giving me the inspiration for a CISC VM reversing challenge and helping me become a better reverser. If you haven't already, you should check out some other reversing challenges from corCTF by other authors. vmstack was a windows VM reversing challenge that had a CISC abstracted to RISC abstracted to subleq vm interpreter running a vm which ran another subleq interpreter, AliceInSleeperio showcased some public tooling utilizing WinLicense Unpacker and Oreans Unvirtualizer, AliceInCeptionland was a C# video game crackme, generic-obfuscator was a classic algorithms reversing challenge, and Circus was a rust task (our team name is Crusaders of Rust after all).