Search This Blog

Thursday, August 26, 2021

corCTF 2021 vmquack writeup: Writing a Custom Binary Ninja Architecture Plugin to Devirtualize a Crackme

vmquack was a CISC VM reversing challenge I wrote for corCTF 2021 loosely based on x86_64 as I have been expanding my skillset beyond just pwn recently. It utilized a few anti-debugging tricks inspired by this pdf, and just for fun, I decided to play around with Binary Ninja's API to write a disassembler for the vmcode with graph view, and even lift it to have decompilation capabilities. Since this is a reversing challenge (and as an author, there is a ton of authorship bias since I'm basically reversing something I wrote source for), I will do my best to present the important pieces of information and try to present it in a way to replicate how someone would approach this challenge.

Initial analysis lets us know that this is a stripped static ELF. Though often seen as an annoyance in reversing challenges, I pretty much had to link it statically to get the crackme to work in a stable manner across different distros. Regardless, my challenge made very little use of library functions and main is trivial to get to from the entry point. Running the binary shows the following:

Trying to run it in gdb results in the following:

Running in strace results in the following:

Already, there is some anti-debug, but these are pretty trivial to bypass - they are in the constructors before main which you can find addresses for in the .init_array. The first one is just setting up another trap signal handler, and since gdb won't use it, the program goes down an alternative execution path that leads to an exit. The second one is just a classic fork, PTRACE_ATTACH PTRACE_ATTACH PTRACE_TRACEME PTRACE_TRACEME anti-debug. If any of these calls return an unexpected value (such as -1 on the first call or 0 on the second call of the same ptrace option), it sends a sigkill signal to the parent thread (the prctl(PR_SET_TRACER, PR_SET_PTRACER_ANY) was needed to ignore the default ptrace_scope values in yama LSM). These mechanisms are quite easy to bypass, but I also revirtualized the second anti-debug technique with 2 more anti-debug/patching techniques hidden in the virtualized code region, which are seen as quack_quack_data at 0x13370000 and quack_quack_text at 0x31337000 in memory and elfparsers.

Looking at main at 0x402d01, another function that is commonly used is 0x402dc7, which is what writes out the program strings (all the program strings were encrypted by an 8 byte xor key and stored in a struct containing the length, the 8 byte xor key, and a pointer to the encrypted string). 0x40299e is the beginning of the vm, and afterwards, the program checks the return value. If it is 0, it is considered a success, else a failure.

Now, we can start reversing the vm handler itself. I'll just highlight a few important parts and emphasize important parts of the logic rather than going through every single handler. It's important to note that the vm handlers were all written with handwritten inline assembly, so a decompiler won't be of much use here.

Looking at the vm start:

As you might have seen in main, rdi carries the location of the vm_context struct, with rsi acting as a pointer to the vmcode, rdx being a pointer to the mmap'd region (which acts as the vm stack), with the next argument representing the vm stack size. As mentioned earlier, this VM is based on x86_64, so the offsets in the context struct correspond to the vm registers that represent their x86_64 equivalent (even the rflags register is included). Although unclear for now, offset 0 of the context struct is a spot for immediate operands.

The rsi register is used as the vm instruction pointer, r15 points to the dispatch table of handlers, and rdi points to the vm context. The following information will become apparent later on as I go through more vm handlers, but rax is utilized as an idx for jump tables and sometimes as a temporary variable, rbx is used to hold information about instruction operating size and addressing modes, r8 and r9 are generally used for destination and source for vm opcode handling, and r10 to r14 are utilized as temporary variables. vmquack supports immediate to register, immediate to memory (where memory address is in a register), register to register, register to memory, memory to register, and memory to memory addressing modes, which is different from x86_64. Operands must match in size, with supported sizes being BYTE, WORD, DWORD, and QWORD.

Now, let's look at the main dispatch handler:

It uses the lodsb instruction to automatically adjust rsi and store the instruction opcode into rax (effectively adjusting our vm rip for us). Notice how it does check the vm opcode is within the dispatch table; if it fails, it jumps into 0x401c71, which is just a ud2 instruction. This jump to ud2 theme is repeated throughout to help guide reversers to understand valid instruction formats.

Now, let's look through the instruction handlers. I'll start by detailing some of the special handlers.

Opcode 0:

At 0x402a14, it just restores all the pushed registers in the vm starting handler, and then returns. This represents an halt opcode, and allows the main program to know the return value of the vm based on the virtual RAX register.

Opcode 1:

This is a syscall handler. This can also be a great way to help us identify and confirm our hypothesis about which offset in the vm context matches with which x86_64 register (the entire VM is basically baby x86_64 with some differences in terms of what operands are expected and valid addressing modes). The return value of the syscall is also stored, before jumping back to the main dispatch handler.

Opcode 3:

This function moves sets the real registers accordingly as if it's going to make a function call according to SYSVABI convention. It also reads a byte after the instruction opcode and treats it as an index. The stack is changed before being restored back later on (for additional arguments that go beyond the number of given registers for function call arguments, although alignment isn't guaranteed here). It also makes a call based on a function table at 0x4b0be0; this is basically a call of the real program's functions. Going through the function table based on known constants and strings, index 0 is for a function that sha256sum verifies portions of the binary against patching, index 1 is malloc, index 2 is free, and index 3 invovles some AES decryption (as you will see later, this is called with your input as arguments when the flag checker passes to give you the flag). Invalid indices do not go to a ud2 in this case, but rather just returns -1 to the vm.

The last special opcode is opcode 32. This does nothing but jumps back to the main_handler, hence it is just a nop.

Before I continue looking at other opcodes, there are a few operand handlers that I would like to discuss.

At 0x401dfc, we see the start of the following function (zoomed in as the graph view is very large):

From this instruction, we can determine that the byte proceeding after most opcodes contain information about the addressing modes (as well as size which you will see later on). Take note how addresses are constantly pushed, then a jump to elsewhere happens. At most of those places, it will end at a ret to return back to that pushed location in the previous handler. I wasn't sure if there was terminology for this, but Asphyxia mentioned to me while working on this that this is obfuscation with defer-like control flow.

We notice that if the first bit is set in the second byte, it jumps to a region that looks like this:

Based on the bits set in the previously consumed byte, it either reads in a byte, a word, a dword, or a qword (as the vm instruction length is variadic). This is the vm consuming some intermediate value. It also moves the value into the r9 register. The value is stored in the immediate register offset in the vm context when it returns to 0x401d7f and r9 will hold the pointer to that location (this will help with the consistency for operating between immediate/register/memory for instruction handlers). If it didn't go into this immediate branch, it goes and consumes a register, which it determines by consuming the next byte, decrementing, and comparing (the correct offset's address will then be loaded into rax, which then gets transferred to r9):

Do note that before the register selection, it also determines whether this register will be used as just a register, or as a pointer to memory. It does so in the push of the address onto the stack which the register cases then return too.

In this case, 0x401d9e represents the case of a plain register, while 0x401dba represents a memory operand based on the register (their disassembly makes that very clear).

Lastly, notice how at the beginning of this handler, the push will make it return 0x401e32, which just sets r8 equal to r9 (r9 held the value of intermediate/register/memory), before returning to the opcode handler that called it. This entire handler is just for determining the addressing modes used for unary operands.

Another commonly used operand handler is at 0x401e3c. It basically does similar things, but only handles registers (and registers pointing to memory) with the 3rd byte in the instruction, and stores results into the r8 register. It's initial push however causes it to return to 0x401e63, which behaves the same as unary handler on the 4th byte (and allows for immediates as well). However, this time the values in r9 aren't moved into r8, and then after these handlers finish, it goes back to the opcode handlers that called it. Overall, this is a binary opcode operand handler. As you will see later on, r8 holds the destination, while r9 holds the source equivalent in x86_64 during the instruction handlers.

The last used operand handler for instructions is at 0x401e9c which does the same thing as the first part in the binary opcode handler, but then goes into 0x401ec7, which just lodsb a byte and moves it into rcx, and then returns into handlers. As one will see later (or can perhaps even guess based on x86_64 behavior), this is a shift opcode handler (but only 1 byte immediates are allowed for source operand).

From the above conditionals for register selection and size, we can determine the following information. The second byte holds addressing mode and size information. The 0th bit represents the use of an immediate source, the 1st bit represents the use of a register source, the 2nd bit represents the use of a memory source, the 3rd bit represents the use of a register destination, the 4th bit represents the use of a memory destination, the 5th bit represents byte sized operations, the 6th bit represents word sized operations, the 7th bit represents dword sized operations, and the default operation size is for qwords. Moreover, when a byte is used to determine the register used for register or memory referencing based on register value, the following values are used: 0 is RAX, 1 is RBX, 2 is RCX, 3 is RDX, 4 is RSI, 5 is RDI, 6 is R8, 7 is R9, 8 is R10, 9 is R11, 10 is R12, 11 is R13, 12 is R114, 13 is R115, 14 is RBP, 15 is RSP, and 16 is RFLAGS.

With these operand handlers for opcodes finished, we can go back to the instruction handlers.

Opcode 2:

This opcode calls the unary operand handler, and only accepts either a qword register or memory operand, or a byte or dword operand. When taking a register or memory operand, it directly sets the vm rip to it. In the other case, it adds it after sign extending. Also note that other cases lead to a ud2, and that stack offset is decremented by 8, with the current vm rip opcode (this will be pointing to the next vm instruction from the current one as rsi is adjusted from lodsb) stored in it. This is a vm call instruction. Opcode 7 is a vm jmp, and does the same thing (except for the push of the next instruction address.

Opcodes 8 to 13 are conditional jumps, and only take relative immediate byte or dword offsets from the next vm instruction. It also stores the real program's flags, before using the vm's flags register to make the determination of the jump before restoring the original flags.

  Opcode 4 is another important one:

If you remember opcode 2, this is popping from the vm stack and restoring the vm rip to it, making this a ret instruction.

Opcode 5 and 6 are quite important. Here is opcode 5:

Notice the stack adjustment, and how it stores the result from the unary handler onto the vm stack. If the operating size is a qword, then immediates aren't allowed. Otherwise, only immediates are allowed and they are sign extended. This is the vm push instruction

Opcode 6 you can probably guess:

Only the qword operand size is allowed, the stack is deremented with the value stored in the specified unary operand. This is a vm pop. Technically, popping into the immediates is allowed by this logic, but I just never did that in my vm code :^)

Opcode 30 is another key handler.

This is the vm mov instruction, which the vmcode itself ended using a lot.

Lastly, I will go over 2 more instruction handlers, one unary instruction and one binary instruction.

This is a vm dec instruction. Notice how the vm flags are set accordingly by switching the program's rflags register to the vm's rflags register, and then restored back accordingly. Also take note of the extra branching for when the case the operand size is dword and the destination operand is a register, as it behaves the same as x86_64 and zeroes out the upper 32 bits.

Opcode 22 is an example of a binary operation.

This is a vm xor, with the source being in r9 and the destination being in r8. Notice how it sets the vm flags and handles register destination of size dwords similarly to the previous unary operation.

The rest of the vm handlers are pretty similar, as knowing what they do is quite obvious (since the instruction is in the handler). A few noteworthy exceptions is that vm_shl and vm_shr use the shift operand handlers as mentioned earlier, and that mul, imul, div, and idiv handle the vm behavior just like how it would handle it in x86_64 (with the exception that now imul and idiv only have the one operand form). There are probably a few bugs throughout the vm handlers (such as in idiv) because writing a sizable vm reversing challenge in manual assembly probably isn't the best thing to do a few days before a CTF starts.

At this point, the handlers should be marked accordingly:

Now, we can write a disassembler to help us figure out what the vmcode is doing. For this section, I decided to experiment with Binary Ninja's API and wrote a crude plugin (source linked at the end of this post); my teammate hgarrereyn on DiceGang always makes Binary Ninja plugins for vm reversing challenges, so this was where I took some my inspiration from. To make my plugin work seamlessly on the dump of vmdata and vmcode data I made, I just broke it into a file format starting with the string vmquack, followed by addresses and respective lengths of each vm section. Disassembly, control flow graphs, binary view, and section permissions are all detailed with this wonderful post from Binary Ninja. As for the rest, you pretty much have to rely on other Binary Ninja plugins (such as their Z80 plugin and a community riscv plugin) with their python api doc. For most of the LLIL lift, you can rely on this page, and just append LLIL instructions to the given IL array for the get_instruction_low_level_il method in the custom architecture class. When implementing the lift, I found it helpful to break each operand into its own LLIL expression, before applying the overall LLIL expression on top. One thing to note is about flags lifting. I didn't do this part very accurately (just enough to generally work), but Binary Ninja has some very basic documentation for it in their dev branch of release. You can create your own (which I didn't) by setting a SpecialFlagRole and overriding behavior in get_flag_write_low_level_il, but they also do handling for some common flags for very few instructions (add, sub, neg). Applying calling convention is another important task for HLIL, but the docs for those are quite self-explanatory; I just set a default calling convention and one for syscalls.

Now with the plugin, I'll provide a brief overview of the VM (graph view and the ability to look between LLIL, MLIL, and decomp can make reversing much faster, especially with all the extra normal features of Binary Ninja such as xrefs and compiler optimization, despite this already being a relatively simple vm program).

The vm starts off at the entry point calling main. The next few following functions are all syscall handlers. 0x31337008 is exit, 0x31337012 is kill, 0x31337026 is write, 0x31337051 is read, and so forth. There is one special case at 0x31337030, which just writes a newline.

Then a series of simple string/memory functions are defined.

HLIL is turning out to be decently accurate! Just to show case graph view as well, here is what strlen looks like:

Binary Ninja's graph view also works in all other modes including LLIL, MLIL, and decomp.

malloc and free are the two functions following strlen (as evidenced by their extern call). Technically, the behavior of this vm's malloc is really calloc. Following those two functions is a nanosleep function and a prctl function to ignore ptrace_scope rules that the vm utilizes throughout with the virtualized anti debug i mentioned earlier.

Before continuing, there are 2 more important functions to address: 0x31337644 and 0x31337724.
The first is just a simple xor decryption routine, given the previously mentioned encrypted message struct in the program itself:

The second function is just a wrapper around this, that also zeroes it out post decryption and frees it.

At this point, there are also 3 anti-debug functions that get utilized throughout.

This calls the real program's verification function to ensure that patching didn't occur.

This above (the entire thing isn't shown) is just a virtualized version of the basic PTRACE_ATTACH pair with PTRACE_TRACEME pair anti debug in the constructors. The other anti-debug (at 0x3133718e) is more special. While it does a similar thing, do take a careful look at the very end

Notice how it's retrieving the regs of the parent process, modifying the R15 register, and then setting the registers of the parent process. Back in the vm syscall handler, you might have noticed how rsi was stored into r15 and then restored back after the syscall. In this case, this anti debug measure is modifying the control flow of the parent process once it's waiting for the child process to exit via wait4 by incrementing the vm rip to skip right to the return rather than decrypting another failure message and exiting.

Before we discuss main now, I will discuss some of the functions used for flag checking. At 0x31337930, the vm initializes some values in the vmdata. I'm calling these state1, state2, and rounds for reasons that will become clear soon.

The following function then uses those global variables:

This is basically an xorshift128p prng, that has an added factor of a rounds variable of byte size to keep it looping based on the results from the pseudo random number generated.

If you xor decrypt with the given constants on the variables in the function stack frame, a quick search should lead you to the fnv1a64 hash function, which is a non cryptographic hash function.

The following is the primary function that is used to transform given input.

It iniitalizes the seed, and then increments a counter by 8 64 times, with our input being the first argument to the function. Each time in the loop, the following happens:

It xors an 8 byte value from an array at 0x13370520 with 0xf33dbeef1337beef, and divides it by 0x7baaadd15ea5e5. It does this for two consecutive 8 byte values and then builds a 2 byte character array by treating the results of the previous arithmetic operations as indices to send to the fnv1a64 hash function, storing the 8 byte result in a uint64_t array at 0x13370420, and then xors that result with the return value from the prng. If one were to apply the “decryption” algorithm on the “index” array, one may realize that the values only range from 0 to 63 but are shuffled.

Anyways, all that is left is now is the main function.

This first part of main is quite self explanatory. Notice how it ensures that your strlen is 0x40 (which matches up with the observation of the “decrypted” indices from earlier on).

Then it takes the transformed input array and compares it to a resultant array (after xoring each 8 bytes with 0x2d969360de674bbb). If all the comparison is good, it calls the flag giver function from the real program and ends.

By extracting our knowledge from this vmcode and data from the vmdata section, we can come up with the following function to generate the correct input:


Entering this passes the checks and finally gets us the flag!

Overall, the solvers had fun in this challenge, and I definitely enjoyed seeing just how powerful the Binary Ninja API can be (I have utilized it a few times before, most recently on the 2k vm challenge from redpwn ctf 2021). The link to the rough plugin (that is not completely accurate) is here for you to play with. A huge shoutout must also go to my teammate st4ckh0und for giving me the inspiration for a CISC VM reversing challenge and helping me become a better reverser. If you haven't already, you should check out some other reversing challenges from corCTF by other authors. vmstack was a windows VM reversing challenge that had a CISC abstracted to RISC abstracted to subleq vm interpreter running a vm which ran another subleq interpreter, AliceInSleeperio showcased some public tooling utilizing WinLicense Unpacker and Oreans Unvirtualizer, AliceInCeptionland was a C# video game crackme, generic-obfuscator was a classic algorithms reversing challenge, and Circus was a rust task (our team name is Crusaders of Rust after all).

No comments:

Post a Comment