Will's Root: corCTF 2023 vmquack's vectorized vices: VM Obfuscation through AVX-512

For the past 2 iterations of corCTF, I’ve written two VM based reversing challenges: vmquack and vmquack’s revenge. The first was a handwritten CISC VM inspired by x86, and the latter was a massive RISC-V inspired VM generated through my own homerolled C compiler.

In this year’s edition, my challenge co-author anematode and I took inspiration from a chapter on AVX-512 programming in Daniel Kusswurm's book Modern Parallel Programming in C++ and Assembly. An important concept here are mask registers - in AVX-512, one can perform SIMD instructions on only selected subsegments of each 512-bit ZMM register through the usage of mask registers. For example, if you specify certain AVX-512 instructions to use register k1-k7 as a bitmask, then something of the form op zmmX, zmmY, zmmZ applied at a 64 bit granularity will only apply the operation at the specified indices in the bitmask and store at the specified indices in the destination (many of the AVX instructions are of RISC style). Any of the mask registers can also store data, including k0 despite its representation as the default no-masking mode for operations, and one can also specify zero-masking to zero out the indices that are unset in the bitmask. There’s a lot more to this in regards to performance programming you can find in the above book or in the Intel SDM, but the above is the primary concept necessary for understanding this VM obfuscation scheme.

Given that each ZMM register can be divided into 8 64 bit registers, I decided to build a VM obfuscation scheme based around AVX-512 and its mask registers, where each 64 bit VM register is represented through a ZMM register and index pair, effectively creating a VM with 256 registers. In contrast to other common VM obfuscation schemes where code is transformed into non-native bytecode that needs to be interpreted, I can inline obfuscated code sections "natively" within x86 functions (the CPU must support AVX-512 F, CD, DQ, and BW though, but players can still run it through the SDE). If this scheme hasn’t been used before, I would like to name it “avxfuscation,” inspired by the famous movsfuscator.

As this is a reversing challenge, it wouldn’t make much sense for me to describe how I solved it as the author. I think it would still be interesting to discuss some of the emitted obfuscated code sequences, my strategy for deobfuscation, a brief tldr of solving the challenge crackme, and the development of the obfuscator.

We begin by discussing the most common operations that appears in the VM.

Firstly, the kmovb instruction will appear quite frequently, usually in the form of kmovb kX, byte ptr [bitmask_constant]. This is the setup for mask registers before VM obfuscated opcodes. As a VM register is a zmmX register and index pair, the bitmask will usually only enable one 64 bit element in any given 512 bit register.

The following sequence is used when storing results from AVX-512 instructions back to globals from zmmX and spilled stack variables (the VM register pressure should have never gotten high enough to trigger the latter scenario):

vpcompressq zmm31 {kX}{z}, zmmX vmovq [mem], zmm31

The first instruction basically compresses the selected qword elements in the source register (which would generally only consist of 1 element in our VM) into zmm31 - note that the zero-masking will zero out the rest of the 7 higher qwords in the destination.

The same sequence as above is used to handle VM register spills, with the destination of the vmovq replaced with the stack spill address. Restoration back into a VM register can be done with vpbroadcastq (which broadcasts as the mnemonic implies) with the specified zmmX index pair as the destination, and the spilled address as the memory operand source. Similarly, loading a global or a constant (in this challenge, I encoded all constants as globals too) or a spilled stack variable for an AVX-512 instruction can be done with the same vpbroadcastq sequence into a temporary VM register.

For the purposes of this writeup, I will now refer to the zmmX and mask index pair as the VM register pair. zmm30 and zmm31 were treated as scratch registers in my obfuscator.

To handle VM register to VM register moves, a vpcompressq transfers the source VM register pair into a VM scratch register (zmm31 or zmm30), and then broadcasts the scratch register into the destination VM register pair.

For most general arithmetic instructions of the CISC form op Y, X, the following sequence is seen:

vpcompressq zmm31 {kX}, zmmX vpbroadcastq zmm31, xmm31 op zmmY {kY}, zmmY, zmm31

The op in the VM was one of vpmullq, vpaddq, vpsubq, vpandq, vporq, vpxorq, vpsllvq, vpsrlvq, vpsravq (the last three for variable shifts). This sequence fills all qword segments of the scratch VM register with the contents of the source VM register, before being used as the second source operand in the main AVX-512 arithmetic instruction, effectively emulates the above CISC instruction form.

Constant shifts are much simpler, as they can be done directly with the VM register pair and a shift constant using vpsllq, vpsrlq, or vpsraq.

In this VM, comparisons were of the RISC-style cmp dest, src1, src2 - this made it easier for me to write the obfuscator as a CISC style compare would have necessitated additional work with the RFLAGS register. The same sequence as the generic arithmetic sequence is used, with op being one of vpcmpeqq, vpcmpneqq, vpcmpltq, vpcmpleq, vpcmpgtq, and vpcmpnltq. These instructions are in the form of: op kX {kY}, zmmZ, zmmW, where the two zmm registers are the two registers used for comparison and the first mask stores the result of the SIMD compare based on the second mask register. The second mask and the first zmm register represent the VM register pair. Then, the vpbroadcastmb2q zmm31, kX instruction is used to broadcast the result mask register into qword elements of the scratch register. If the destination VM register pair’s index is not 0, the result is shifted in with vpsrlq from the scratch register with the correct mask value; otherwise, vmovdqu64 is used. This sequence provides a nice property in which the results of comparisons are always guaranteed to be 0 or 1.

A similar idea applies to the VM’s jumps. Instead of vpbroadcastmb2q followed by a store, ktestb (the equivalent of a normal test in x86) is used with either jz or jnz along with a potential jmp (depending on which basic block followed, which my obfuscator laid out using a min cut max flow algorithm where loop nesting depth equates to cost).

Like the VM comparisons, there were also a RISC-style neg and not instruction sequence (Ex. dest = not src, dest = neg src). neg simply involved zeroing out a scratch register with vpxorq zmm31, zmm31, zmm31 followed by vpsubq to the VM destination register pair. not is a bit more interesting, especially as I aimed to preserve the property of the result being only 0 or 1 like in comparison - my obfuscation procedure ensured that arguments to the not instruction were always of “boolean” type, which means either 0 or 1 too. After the initial vpcompressq and vpbroadcastq to setup the scratch zmm31 register with the contents of the VM source register, the following sequence arose before broadcasting zmm31 into the destination VM register:

vpternlogq zmm31, zmm31, zmm31, 0xf vpxorq zmm30, zmm30, zmm30 vpternlogq zmm30, zmm30, zmm30, 0xf vpabsq zmm30, zmm30 vpandq zmm31, zmm31, zmm30

vpternlogq is a fascinating instruction. To quote the SDM:

VPTERNLOGD/Q takes three bit vectors of 512-bit length (in the first, second, and third operand) as input data to form a set of 512 indices, each index is comprised of one bit from each input vector. The imm8 byte specifies a boolean logic table producing a binary value for each 3-bit index value. The final 512-bit boolean result is written to the destination operand (the first operand) using the writemask k1 with the granularity of doubleword element or quadword element into the destination.

Basically, the bits of the byte size intermediate acts as a logic table that is indexed for each bit triplet formed from the three operands, with the most significant bit coming from the first operand. Given the bit representation for 0xf and that all three operands are the same, zmm31 will become complemented. if I’m not mistaken, there are a few other constants that would also work. The sequence of vpxorq, vpternlogq on zmm30 will then set zmm30 to -1 in two’s complement, before having vpabsq convert it back to 1. The vpandq will thus complete the RISC-not.

Division and modulus are quite complex. The general idea is to use vcvtqq2pd to convert qword integers into double precision floating point, before applying vdivpd and converting back to a qword integer with vcvttpd2qq. Finding the modulo with this result is then trivial with multiplication and subtraction. I first heard of a trick similar to this from anematode when attempting to optimize variable division on x86 - while the latency won’t necessarily be much better, the reliance on the AVX unit will greatly improve throughput on Intel CPUs in contrast to how idiv affects the ALU. However, this VM obfuscation only works if the divisor is not zero, and both the dividend and divisor fit losslessly within the integer limit for double-precision floating point (so between -2^53 and 2^53). In that case, there is no other choice but to revert back to idiv. The following instruction sequence was used after loading zmm30 with the dividend and zmm31 with the divisor (note that division and modulus are also in a RISC format in my obfuscator):

vpabsq zmm30 kX, zmm30 vpabsq zmm31 kX, zmm30 vfpclasspd kY {kX}, zmm30, 0x22 vfpclasspd kZ {kX}, zmm31, 0x22 ktestb kY, kZ jz .edge_div vcvtqq2pd zmm30, zmm30 vcvtqq2pd zmm31, zmm31 vdivpd zmm30, zmm30, zmm31 vpbroadcastq zmm30, xmm30 vcvttpd2qq zmmDest {kDest}, zmm30 jmp .end_div .edge_div: vmovq rax, xmm30 cqo vmovq gpr, zmm31 idiv gpr vpbroadcastq zmmDest {kDest}, rax .end_div:

kX is setup to hold the constants of a special bitmask of the value 0b10101010 - this way, the absolute value only applies to the odd qword elements in scratch registers storing the divisor and the dividend. vfpclasspd then tests for properties marked by 0x22 using the same bitmask (before storing into another mask register) - based on the Intel SDM, 0x22 checks for either positive zero or a denormal value, which means that the exponent component is zero. Note that this instruction is supposed to be run on double precision floating point numbers, but we are running it on integers and would like to check if it can be losslessly stored in range. The check isn’t exact though, as some valid cases can still be sent to idiv - if a number is denormal, then either the exponent is all zeroes (so the number will be <= 2^ 52) or the leading bit of the mantissa is zero (which if I understand correctly, the leading bit is always implicitly 1 unless the number is denormal). Only when writing this writeup, I realized that the zero check should not be included for the dividend, as then divides by zero would not trigger an exception, though I guess some additional work with the MXCSR register could have properly virtualized this behavior. Lastly, in the obfuscated path, vcvttpd2qq runs on index 0 of zmm30 (which was unaffected by vpabsq) to store the result back in integer form.

Anematode and I also managed to obfuscate array indexing, memory addrof, load, and store instructions in this format.

For qword array indexing, the address is first fetched with an lea into a gpr (general purpose register). Then the index VM-register pair is compressed and broadcasted into zmm31. To store the value into VM register pair Y, we used vpgatherqq zmmY {kY}, [gpr + zmm31 * 8]. This loads the contents pointed to by the gpr using the array of qword indices with a scale factor of 8 into zmmY given the mask in kY.

For memory addrof given an index, the array address is first loaded into a gpr and broadcasted into zmm31. The index is fetched and broadcasted into zmm30, before being multiplied by 8. A vpaddq instruction is now used to get the correct memory address given an index and stored into a destination VM register pair. This allows us to then perform load and stores using our VM registers too.

Stores and loads are accomplished with vpscatterqq and vpgatherqq respectively. The gpr representing the base address is set to zero and the indexing zmm register is filled with the address that was fetched from the VM register holding the pointer.

We also had support for zeroing and initialization for local arrays. Zeroing can be done with vmovdqu64 to the array address, and I started it from sizes of zmmword all the way down to xmmword, before switching to vmovq for qword moves. Initialization can be done by utilizing the same two instructions from above in the manner of memcpy - my obfuscator pre-stored all the necessary data for local array initialization in the .data section.

The last group of instructions to discuss are in regards to calls and returns. Unfortunately, these could not be obfuscated, but care had to be taken for calls as I wanted to abide by SYSVABI to preserve compatibility with external non obfuscated function calls. At the start of each function call, all the arguments have to be loaded to the correct VM-register pairs, and the same has to be done before function calls. Before storing any of the VM-registers to argument registers, I had to allocate additional stack space to store all zmm registers that hold VM-registers live across the function call and then restore these (as I believe these are all considered caller saved in SYSVABI). Additional care had to be taken when the VM-register storing the return value resides in one of the spilled ZMM register of a VM register pair.

Now that I have explained the VM, it is time to devirtualize it. My intended devirtualization path last year would have just been “write a decompiler plugin in SLEIGH or Binary Ninja LLIL” - while such a task is educational, doing that again is just tedious. Another weakness with the strategy is that users unfamiliar with the API’s intricacies would fail to capture a VM’s architectural complexities (and sometimes the API might not even be capable of doing so).

One observation I made are that most AVX-512 instructions are quite long - perhaps I can just pattern match the obfuscated disassembly and patch it back into a deobfuscated binary! This is actually quite a common (but naive) approach people take when deobfuscating production virtualization engines like VMProtect or older versions of Themida. An issue that arose immediately is that x86 simply doesn’t have 256 general purpose registers, but we can resolve that through stack loads and stores.

For pattern matching, I went through each function and at each instruction address, I attempt to exactly match the longest pattern - if none are found, I start matching from the next instruction (as the current instruction could have been an unobfuscated). Otherwise, I note down the pattern match and start matching again from the instruction that subsequently follows the pattern. If a pattern is detected to be a variadic sequence like in local array zeroing or initialization, the next round of pattern matching starts from where the detected sequence ends.

For mapping in the usage of stack loads and stores for variables, there were three cases I had to deal with.The first case is the sub rsp, X, mov rbp, rsp sequence in the prologue, followed by a leave in the epilogue. The second case is simply just a mov rbp, rsp followed by a pop rbp sequence, for cases where a non-leaf function needs to preserve 16 byte stack alignment but doesn’t need any local variable allocations. Both of these can just be replaced with an enter sequence to allocate the required stack space for all the VM registers and any previous stack allocations, with leave patched in for both epilogue sequences. Now, VM-registers can just be referenced through rbp relative addressing. In the case of leaf functions that require no stack allocations, VM-registers have to be referenced through rsp relative addressing.

Care has to be taken about instruction space when we match all the patterns in a given function and begin to patch it. It is true that most AVX-512 instruction sequences are long, but 64 bit instruction prefix combined with memory operands to replicate VM-registers can in some cases be longer. To resolve this, I pad shorter x86 deobfuscated sequences with nops that I track and try to shift every subsequent obfuscation instruction upwards during de-obfuscation. An instruction can be shifted upwards only if it is preceeded by a nop, and does not affect control flow, which I verify by checking for xrefs to the current address of the instruction. Of course, this is only a simplistic model due to the existence of indirect jumps and calls, but suffices for the obfuscated binaries I generated.

Lastly, it is important to keep track of the states when deobfuscating for certain registers like the masking registers. Otherwise, the deobfuscator loses context like the currently selected qword of a zmm register. I opted out of doing any optimizations for masking registers that cross obfuscated instruction sequence boundaries as that would drastically increase complexity and might introduce multiple possible values for mask registers in some control flow blocks (I guess peepholing at the basic block level would have been fine though).

I ended up prototyping the following Binary Ninja script with comments for pattern matching based deobfuscation. In many cases, it can create an equivalent deobfuscated binary that runs properly!

from binaryninja import * from enum import Enum from collections import namedtuple import traceback, sys, math FUNCTIONS = list(bv.functions) br = BinaryReader(bv) bw = BinaryWriter(bv) arch = Architecture['x86_64'] symtab = SymbolMapping(bv) # figure out how to deobfuscate by matching first instructions Inst = Enum('Inst', [ 'KMOVB', 'KNOTB', 'KTESTB', 'VPBROADCASTQ', 'VPCOMPRESSQ', 'VPGATHERQQ', 'VPSCATTERQQ', 'VMOVDQU64', 'VMOVQ', 'VPCMPQ', 'VPCMPGTQ', 'VPCMPEQQ', 'VPSRLQ', 'VPSRAQ', 'VPSLLQ', 'VPSRLVQ', 'VPSRAVQ', 'VPSLLVQ', 'VPXORQ', 'VPORQ', 'VPANDQ', 'VPSUBQ', 'VPADDQ', 'VPMULLQ', 'VPBROADCASTMB2Q', 'VDIVPD', 'VCVTQQ2PD', 'VCVTTPD2QQ', 'VFPCLASSPD', 'VPABSQ', 'VPTERNLOGQ', 'JZ', 'JE', 'JMP', 'LEA', 'XOR', 'MOV', 'SUB', 'ADD', 'CQO', 'IDIV', 'OTHER', ]) # define series of instructions to match # anything else mapping = { "kmovb" : Inst.KMOVB, "knotb" : Inst.KNOTB, "ktestb": Inst.KTESTB, "vpbroadcastq" : Inst.VPBROADCASTQ, "vpcompressq" : Inst.VPCOMPRESSQ, "vpgatherqq" : Inst.VPGATHERQQ, "vpscatterqq" : Inst.VPSCATTERQQ, "vmovdqu64" : Inst.VMOVDQU64, "vmovq" : Inst.VMOVQ, # binja doesn't seem to be able to decode all neatly "vpcmpq" : Inst.VPCMPQ, "vpcmpgtq" : Inst.VPCMPGTQ, "vpcmpeqq" : Inst.VPCMPEQQ, "vpsrlq" : Inst.VPSRLQ, "vpsraq" : Inst.VPSRAQ, "vpsllq" : Inst.VPSLLQ, "vpsrlvq" : Inst.VPSRLVQ, "vpsravq" : Inst.VPSRAVQ, "vpsllvq" : Inst.VPSLLVQ, "vpxorq" : Inst.VPXORQ, "vporq" : Inst.VPORQ, "vpandq" : Inst.VPANDQ, "vpsubq" : Inst.VPSUBQ, "vpaddq" : Inst.VPADDQ, "vpmullq" : Inst.VPMULLQ, "vpbroadcastmb2q" : Inst.VPBROADCASTMB2Q, "vdivpd" : Inst.VDIVPD, "vcvtqq2pd" : Inst.VCVTQQ2PD, "vcvttpd2qq" : Inst.VCVTTPD2QQ, "vfpclasspd" : Inst.VFPCLASSPD, "vpabsq" : Inst.VPABSQ, "vpternlogq" : Inst.VPTERNLOGQ, "jz" : Inst.JZ, "je" : Inst.JE, "jmp" : Inst.JMP, "lea" : Inst.LEA, "xor" : Inst.XOR, "mov" : Inst.MOV, "sub" : Inst.SUB, "add" : Inst.ADD, "cqo" : Inst.CQO, "idiv" : Inst.IDIV, } def k_reg_idx(k_str): val = int(k_str[1:]) assert(val >= 0 and val <= 7) return val def zmm_reg_idx(zmm_str): val = int(zmm_str[3:]) assert(val >= 0 and val <= 31) return val equiv_ops = { 'vpaddq' : 'add', 'vpsubq' : 'sub', 'vpmullq' : 'imul', 'vpandq' : 'and', 'vporq' : 'or', 'vpxorq' : 'xor', 'vpcmpeqq' : 'sete', 'vpcmpgtq' : 'setnle', 'vpsllq' : 'shl', 'vpsllvq' : 'shl', 'vpsraq' : 'sar', 'vpsravq' : 'sar', 'vpsrlq' : 'shr', 'vpsrlvq' : 'shl', } cmp_suffix = { 0x4 : 'setne', # vpcmpneqq 0x1 : 'setl', # vpcmpltq 0x2 : 'setle', # vpcmpleq 0x5 : 'setge', # vpcmpnltq } size_map = { 'zmmword' : 64, 'ymmword' : 32, 'xmmword' : 16, } BinjaInst = namedtuple('BinjaInst', ['disas', 'addr']) # [start, end) class AddrRange(namedtuple('AddrRange', ['start', 'end'])): def __repr__(self): return f'AddrRange(start={hex(self.start)}, end={hex(self.end)})' VmReg = namedtuple('VmReg', ['zmm', 'idx']) def ex_broadcastq_vmreg(disas): return VmReg(zmm_reg_idx(disas[1]), k_reg_idx(disas[2])) def ex_compressq_vmreg(disas): if 'zmm' not in disas[3]: return VmReg(zmm_reg_idx(disas[4]), k_reg_idx(disas[2])) return VmReg(zmm_reg_idx(disas[3]), k_reg_idx(disas[2])) def ex_general_vmreg(disas): return VmReg(zmm_reg_idx(disas[1]), k_reg_idx(disas[2])) def ex_vpcmpq_vmreg(disas, flip=False): return VmReg(zmm_reg_idx(disas[4 if flip else 3]), k_reg_idx(disas[2])) def get_equiv_cmp_op(disas): if disas[0] == 'vpcmpq': return cmp_suffix[int(disas[5], 16)] else: return equiv_ops[disas[0]] def tokenize(inst): if inst not in mapping: return Inst.OTHER return mapping[inst] # https://github.com/intelxed/xed/blob/b53f33e44dd2e5fb2e63f9c0e35c65b72c933dae/src/enc/xed-encode.c#L460 nop_variants = [ bytes([0x90 ]), bytes([0x66, 0x90]), bytes([0x0F, 0x1F, 0x00]), bytes([0x0F, 0x1F, 0x40, 0x00]), bytes([0x0F, 0x1F, 0x44, 0x00, 0x00]), bytes([0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00]), bytes([0x0F, 0x1F, 0x80, 0x00, 0x00,0x00, 0x00]), bytes([0x0F, 0x1F, 0x84, 0x00, 0x00, 0x00,0x00, 0x00]), bytes([0x66, 0x0F, 0x1F, 0x84, 0x00, 0x00, 0x00,0x00, 0x00])] def multi_nop_opt(orig): if len(orig) == 0: return b'' else: idx = max([(len(n),i) for i,n in enumerate(nop_variants) if len(n) <= len(orig)])[1] return nop_variants[idx] + multi_nop_opt(orig[len(nop_variants[idx]):]) class Context: ''' patching done through context - some guidelines if rbp is initialized, then stack is also allocated from our compiler we can replace the whole sequence starting from push rbp with enter new stack alloc else, rbp remains uninitialized we have two patching sequences - nop_patch and patch the patterns of the obfuscator should almost always? emit something that's nop_patchable first so in each nop patch, we check if the range would allow for an enter sequence, and then patch there and replace pop rbp in the end with a leave additionally, if there is never a push rbp, then it is most likely a leaf function with no vars we should be able to just use rsp relative addressing since we also do back patching later on if there are enough space, we have to keep track of k-values whenever they are set or invalidated at their addresses - we use the original address here rather than the adjusted address, and the stack offset querying should follow the same principle ''' def __init__(self, initial_stack_alloc, push_rbp_exists, rbp_initialized, initial_insts, func_addr): self.zmm = {z : {i : 0 for i in range(0, 8)} for z in range(0, 32)} # can map to value none if result of non kmovb self.k = {i : {} for i in range(0, 8)} self.initial_stack_alloc = initial_stack_alloc # the final 16 byte allocation is for the sake of scratch space self.stack_size = self.initial_stack_alloc + 32 * 8 * 8 + 16 self.push_rbp_exists = push_rbp_exists self.rbp_initialized = rbp_initialized self.func_start = initial_insts[0].addr self._scratch30 = None self._scratch31 = None self.push_rbp_addr = initial_insts[0].addr if push_rbp_exists else None if push_rbp_exists and (initial_insts[-2].disas != ['pop', 'rbp'] and initial_insts[-2].disas != ['leave']): raise Exception(f'Function epilogue not seen for function starting at {hex(func_addr)} - adjust Binary Ninja analysis accordingly') self.pop_rbp_addr = initial_insts[-2].addr if push_rbp_exists else None self.rbp_init_range = AddrRange(initial_insts[1].addr, initial_insts[3].addr) if rbp_initialized else None self.pending_writes = [] self.comments = {} # address of known nops, to avoid constantly re-updating analysis self.nops = {i.addr for i in initial_insts} if self.push_rbp_exists and self.rbp_initialized: assert(self.stack_size <= 2 ** 15 - 1) opcode = arch.assemble(f"enter {self.stack_size}, 0") self.__write(opcode, self.push_rbp_addr) self.__nop(self.push_rbp_addr + len(opcode), self.rbp_init_range.end) def commit_patches(self, comment=True): for i,p in enumerate(self.pending_writes): data,addr = p if set(data) != {0x90}: bw.write(data, addr) else: # multi nop optimization to make disas look better curr_real_end = min([a for _,a in self.pending_writes[i+1:] if a > addr], default=len(data) + addr) data = data[:curr_real_end - addr] opt_data = multi_nop_opt(data) assert(len(opt_data) == len(data)) bw.write(opt_data, addr) if comment: for addr, c in self.comments.items(): bv.set_comment_at(addr, c) self.pending_writes = [] self.comments = {} def __write(self, data, addr): self.pending_writes.append((data, addr)) def __nop(self, start, end): self.__write(b'\x90' * (end - start), start) self.nops |= {i for i in range(start, end)} def set_k(self, k_str, val, addr): self.k[k_reg_idx(k_str)][addr] = val def invalidate_k(self, k_str, addr): self.k[k_reg_idx(k_str)][addr] = None def __fetch_k_val(self, k_idx, addr): assert(len(self.k[k_idx]) > 0) closest_val = self.k[k_idx][max([a for a in self.k[k_idx].keys() if a < addr])] assert(closest_val is not None) assert(closest_val < 8) return closest_val @property def scratch30(self): assert(self._scratch30 is not None) return self._scratch30 @scratch30.setter def scratch30(self, val): self._scratch30 = val @property def scratch31(self): assert(self._scratch31 is not None) return self._scratch31 @scratch31.setter def scratch31(self, val): self._scratch31 = val def nop_patch(self, addr_range): length = addr_range.end - addr_range.start if self.push_rbp_exists and not self.rbp_initialized: opcode = arch.assemble(f"enter {self.stack_size}, 0") if len(opcode) <= length: self.__nop(self.push_rbp_addr, self.push_rbp_addr + 1) self.__write(opcode, addr_range.start) self.__nop(addr_range.start + len(opcode), addr_range.end) self.__write(arch.assemble("leave"), self.pop_rbp_addr) self.rbp_initialized = True else: self.__nop(addr_range.start, addr_range.end) return True def patch(self, addr_range, asm): # try to shift it up as much as possible start = addr_range.start end = addr_range.end opcodes = arch.assemble(asm) while start > self.func_start: if (start - 1) in self.nops: if len([*bv.get_code_refs(start - 1)]) == 0 or start - 1 == self.func_start: start -= 1 continue break # overflow patch if (len(opcodes) > addr_range.end - start): rem = len(opcodes) - (addr_range.end - start) for i in range(0, rem): if (end + i) not in self.nops and len([*bv.get_code_refs(end + i)]) == 0: return False end += rem self.__write(opcodes, start) # remove that region as nops self.nops -= {start + i for i in range(len(opcodes))} self.__nop(start + len(opcodes), end) self.comments[start] = f'patched from {hex(addr_range.start)} - {hex(addr_range.end)}' return True def get_stack_offset(self, zmm_reg_tuple, addr, stringify=True, debug=False): zmmX, kreg = zmm_reg_tuple if debug: print(hex(addr)) print(zmm_reg_tuple) print(self.__fetch_k_val(kreg, addr)) assert(zmmX < 30) offset = self.initial_stack_alloc + (zmmX * 8 + self.__fetch_k_val(kreg, addr)) * 8 + (0 if self.push_rbp_exists else 16) if not stringify: return offset else: return f'rbp - {offset}' if self.push_rbp_exists else f'rsp - {offset}' nop_handler = lambda range, disas, ctx : True def resolve(x): try: return int(x, 16) except ValueError: sym = symtab[x] assert(len(sym) == 1) return sym.address def get_lea_base(disas): if len(disas) == 5: return resolve(disas[3]) else: return ''.join(disas[3:-1]) read_qword = lambda x : br.read64le(resolve(x)) read_dword = lambda x : br.read32le(resolve(x)) read_word = lambda x : br.read16le(resolve(x)) read_byte = lambda x : br.read8(resolve(x)) read_dqword = lambda x : read_qword(resolve(x)) + (read_qword(resolve(x) + 8) << 64) read_oqword = lambda x : read_dqword(resolve(x)) + (read_dqword(resolve(x) + 16) << 128) read_oqword = lambda x : read_oqword(resolve(x)) + (read_oqword(resolve(x) + 32) << 256) get_bit_idx = lambda x : int(math.log2(x)) def handle_k_reg(addr_range, disas, ctx): assert(len(disas) == 1) ctx.set_k(disas[0][1], get_bit_idx(read_qword(disas[0][4])), addr_range.start) return ctx.nop_patch(addr_range) def handle_general_arith(addr_range, disas, ctx): src = ctx.get_stack_offset(ex_compressq_vmreg(disas[0]), addr_range.start) dest = ctx.get_stack_offset(ex_general_vmreg(disas[2]), addr_range.start) asm = '\n'.join([f'mov rax, [{dest}]', f'{equiv_ops[disas[-1][0]]} rax, [{src}]', f'mov [{dest}], rax']) return ctx.patch(addr_range, asm) def handle_not(addr_range, disas, ctx): src = ctx.get_stack_offset(ex_compressq_vmreg(disas[0]), addr_range.start) dest = ctx.get_stack_offset(ex_broadcastq_vmreg(disas[-1]), addr_range.start) asm = '\n'.join([f'mov rax, [{src}]', 'xor eax, 1', f'mov [{dest}], rax']) return ctx.patch(addr_range, asm) def handle_neg(addr_range, disas, ctx): src = ctx.get_stack_offset(ex_compressq_vmreg(disas[0]), addr_range.start) dest = ctx.get_stack_offset(ex_general_vmreg(disas[2]), addr_range.start) asm = '\n'.join([f'mov rax, [{src}]', 'neg rax', f'mov [{dest}], rax']) return ctx.patch(addr_range, asm) def handle_assign(addr_range, disas, ctx): assert(len(disas) == 2) src = ctx.get_stack_offset(ex_compressq_vmreg(disas[0]), addr_range.start) dest = ctx.get_stack_offset(ex_broadcastq_vmreg(disas[1]), addr_range.start) asm = '\n'.join([f'mov rax, [{src}]', f'mov [{dest}], rax']) return ctx.patch(addr_range, asm) def handle_broadcast(addr_range, disas, ctx): assert(len(disas) == 1) dest = ctx.get_stack_offset(ex_broadcastq_vmreg(disas[0]), addr_range.start) # two variants, memory or register asm = '' if len(disas[0]) > 4: mem = ''.join(disas[0][4:]) asm = '\n'.join([f'mov rax, {mem}', f'mov [{dest}], rax']) else: asm = f'mov [{dest}], {disas[0][3]}' return ctx.patch(addr_range, asm) def handle_arr_idx(addr_range, disas, ctx): arr_base = get_lea_base(disas[0]) arr_idx = ctx.get_stack_offset(ex_compressq_vmreg(disas[1]), addr_range.start) dest = ctx.get_stack_offset(ex_general_vmreg(disas[3]), addr_range.start) asm = '\n'.join([f'mov rax, [{arr_idx}]', f'lea rax, [{arr_base} + 8 * rax]', f'mov rax, [rax]', f'mov [{dest}], rax']) return ctx.patch(addr_range, asm) def handle_arr_addr(addr_range, disas, ctx): arr_base = get_lea_base(disas[0]) arr_idx = ctx.get_stack_offset(ex_compressq_vmreg(disas[2]), addr_range.start) dest = ctx.get_stack_offset(ex_general_vmreg(disas[-1]), addr_range.start) asm = '\n'.join([f'mov rax, [{arr_idx}]', f'lea rax, [{arr_base} + 8 * rax]', f'mov [{dest}], rax']) return ctx.patch(addr_range, asm) def handle_load(addr_range, disas, ctx): dest = ctx.get_stack_offset(ex_general_vmreg(disas[-1]), addr_range.start) val = ctx.get_stack_offset(ex_compressq_vmreg(disas[1]), addr_range.start) asm = '\n'.join([f'mov rax, [{val}]', f'mov [{dest}], rax']) return ctx.patch(addr_range, asm) def handle_store(addr_range, disas, ctx): dest = ctx.get_stack_offset(ex_compressq_vmreg(disas[1]), addr_range.start) val = ctx.get_stack_offset(ex_compressq_vmreg(disas[2]), addr_range.start) asm = '\n'.join([f'mov rcx, [{val}]', f'mov rax, [{dest}]', 'mov qword [rax], rcx']) return ctx.patch(addr_range, asm) def handle_avx_spill_reloads(addr_range, disas, ctx): if disas[0][0] == 'knotb': ctx.invalidate_k(disas[0][1], addr_range.start) return ctx.nop_patch(addr_range) def handle_spill(addr_range, disas, ctx): spill_var = ctx.get_stack_offset(ex_compressq_vmreg(disas[0]), addr_range.start) asm = '' # memory case if disas[1][2] == '[': mem = ''.join(disas[1][2:-1]) asm = '\n'.join([f'mov rax, [{spill_var}]', f'mov {mem}, rax']) else: asm = f'mov {disas[1][1]}, [{spill_var}]' return ctx.patch(addr_range, asm) def handle_div_mod(addr_range, disas, ctx, is_mod = False): ctx.invalidate_k(disas[0][1], addr_range.start) ctx.invalidate_k(disas[9][1], addr_range.start) src1 = ctx.get_stack_offset(ex_compressq_vmreg(disas[1]), addr_range.start) src2 = ctx.get_stack_offset(ex_compressq_vmreg(disas[3]), addr_range.start) dest = ctx.get_stack_offset(ex_general_vmreg(disas[15]), addr_range.start) asm = '\n'.join([f'mov rax, [{src1}]', f'mov rdi, [{src2}]', 'cqo', f'idiv rdi', f'mov [{dest}], {"rax" if not is_mod else "rdx"}']) return ctx.patch(addr_range, asm) def handle_risc_cmp(addr_range, disas, ctx): ctx.invalidate_k(disas[2][1], addr_range.start) src1 = ctx.get_stack_offset(ex_vpcmpq_vmreg(disas[2]), addr_range.start) src2 = ctx.get_stack_offset(ex_compressq_vmreg(disas[0]), addr_range.start) dest = ctx.get_stack_offset(ex_general_vmreg(disas[4]), addr_range.start) op = get_equiv_cmp_op(disas[2]) asm = '\n'.join([f'mov rax, [{src1}]', f'mov rdx, [{src2}]', f'cmp rax, rdx', f'{op} al', 'movzx eax, al', f'mov [{dest}], rax']) return ctx.patch(addr_range, asm) def handle_shift(addr_range, disas, ctx): dest = ctx.get_stack_offset(ex_general_vmreg(disas[0]), addr_range.start) asm = f'{equiv_ops[disas[0][0]]} qword [{dest}], {disas[0][4]}' return ctx.patch(addr_range, asm) def handle_shift_var(addr_range, disas, ctx): dest = ctx.get_stack_offset(ex_general_vmreg(disas[2]), addr_range.start) shift = ctx.get_stack_offset(ex_compressq_vmreg(disas[0], addr_range.start)) asm = '\n'.join([f'mov rcx, [{shift}]' f'{equiv_ops[disas[0][0]]} qword [{dest}], {disas[0][4]}']) return ctx.patch(addr_range, asm) def handle_jcc(addr_range, disas, ctx): ctx.invalidate_k(disas[3][1], addr_range.start) special_cond = lambda : disas[0][0] == 'mov' src1 = f'[{ctx.get_stack_offset(ex_compressq_vmreg(disas[0]), addr_range.start)}]' if not special_cond() else '1' src2 = ctx.get_stack_offset(ex_vpcmpq_vmreg(disas[2], flip=False if special_cond() else True), addr_range.start) op = get_equiv_cmp_op(disas[2]) asm = '\n'.join([f'mov rax, {src1}', f'mov rdx, [{src2}]', f'cmp rax, rdx', f'{op} al', 'cmp al, 0',]) return ctx.patch(addr_range, asm) def handle_arr_init(addr_range, disas, ctx): dest = ''.join(disas[0][2:]) src = ''.join(disas[1][2:]) size = 0 for d in disas[2:][::2]: if d[0] == 'vmovq': size += 8 else: size += size_map[d[3].replace(' ', '')] size //= 8 asm = '\n'.join([f'lea rsi, {src}', f'lea rdi, {dest}', f'mov ecx, {size}', 'rep movsq']) return ctx.patch(addr_range, asm) def handle_zarr_init(addr_range, disas, ctx): dest = ''.join(disas[0][2:]) size = 0 for d in disas[2:]: if d[0] == 'vmovq': size += 8 else: size += size_map[d[1].replace(' ', '')] size //= 8 asm = '\n'.join([f'lea rdi, {dest}', 'xor rax, rax', f'mov ecx, {size}', 'rep stosq']) return ctx.patch(addr_range, asm) def match_arr_init_var(disas): if len(disas) >= 2 and ((disas[0] == Inst.VMOVDQU64 and disas[1] == Inst.VMOVDQU64) or (disas[0] == Inst.VMOVQ and disas[1] == Inst.VMOVQ)): return disas[:2] + match_arr_init_var(disas[2:]) return [] def match_zarr_init_var(disas): if len(disas) >= 1 and disas[0] == Inst.VMOVDQU64 or disas[0] == Inst.VMOVQ: return disas[:1] + match_zarr_init_var(disas[1:]) return [] # pattern, deobfuscation handler, name, variadic, variadic matcher (which returns the resulting match list for the variadic portion) # each element that's an array represents variants at that spot match_k_reg_setup = ([Inst.KMOVB], handle_k_reg, "k_reg_setup", False, None) match_arith_general = ([Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, [Inst.VPADDQ, Inst.VPMULLQ, Inst.VPANDQ, Inst.VPORQ, Inst.VPXORQ, Inst.VPSUBQ]], handle_general_arith, "arith_general", False, None) match_risc_cmp = ([Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, [Inst.VPCMPQ, Inst.VPCMPGTQ, Inst.VPCMPEQQ], Inst.VPBROADCASTMB2Q, [Inst.VPSRLQ, Inst.VMOVDQU64]], handle_risc_cmp, "risc_cmp", False, None) match_not = ([Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, Inst.VPTERNLOGQ, Inst.VPXORQ, Inst.VPTERNLOGQ, Inst.VPABSQ, Inst.VPANDQ, Inst.VPBROADCASTQ], handle_not, "not", False, None) match_neg = ([Inst.VPXORQ, Inst.VPSUBQ], handle_neg, "neg", False, None) match_div = ([Inst.KMOVB, Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, Inst.VPABSQ, Inst.VPABSQ, Inst.VFPCLASSPD, Inst.VFPCLASSPD, Inst.KTESTB, [Inst.JZ, Inst.JE], Inst.VCVTQQ2PD, Inst.VCVTQQ2PD, Inst.VDIVPD, Inst.VPBROADCASTQ, Inst.VCVTTPD2QQ, Inst.JMP, Inst.VMOVQ, Inst.CQO, Inst.VMOVQ, Inst.IDIV, Inst.VPBROADCASTQ, ], handle_div_mod, "div", False, None) match_mod = ([Inst.KMOVB, Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, Inst.VPABSQ, Inst.VPABSQ, Inst.VFPCLASSPD, Inst.VFPCLASSPD, Inst.KTESTB, [Inst.JZ, Inst.JE], Inst.VCVTQQ2PD, Inst.VCVTQQ2PD, Inst.VDIVPD, Inst.VPBROADCASTQ, Inst.VCVTTPD2QQ, Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, Inst.VPMULLQ, Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, Inst.VPSUBQ, Inst.JMP, Inst.VMOVQ, Inst.CQO, Inst.VMOVQ, Inst.IDIV, Inst.VPBROADCASTQ, ], lambda x,y,z : handle_div_mod(x,y,z, is_mod=True), "mod", False, None) match_assign = ([Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ], handle_assign, "assign", False, None) match_jcc = ([[Inst.MOV, Inst.VPCOMPRESSQ], Inst.VPBROADCASTQ, [Inst.VPCMPQ, Inst.VPCMPGTQ, Inst.VPCMPEQQ], Inst.KTESTB], handle_jcc, "jcc", False, None) match_shifts = ([[Inst.VPSLLQ, Inst.VPSRAQ, Inst.VPSRLQ]], handle_shift, "shifts", False, None) match_var_shifts = ([Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, [Inst.VPSLLVQ, Inst.VPSRAVQ, Inst.VPSRLVQ]], handle_shift_var, "var_shifts", False, None) match_broadcast = ([Inst.VPBROADCASTQ], handle_broadcast, "broadcast", False, None) match_spill = ([Inst.VPCOMPRESSQ, Inst.VMOVQ], handle_spill, "spill", False, None) match_stack_spill = ([Inst.SUB, Inst.VMOVDQU64], handle_avx_spill_reloads, "stack_spill", False, None) match_stack_reload = ([Inst.VMOVDQU64, Inst.ADD], handle_avx_spill_reloads, "stack_reload", False, None) match_stack_reload_variant = ([Inst.KNOTB, Inst.VMOVDQU64, Inst.ADD], handle_avx_spill_reloads, "stack_reload_variant", False, None) # array initilization don't need pattern matching, that shouldn't be too bad to recognize match_arr_idx = ([Inst.LEA, Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, Inst.VPGATHERQQ], handle_arr_idx, "arr_idx", False, None) match_arr_addr = ([Inst.LEA, Inst.VPBROADCASTQ, Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, Inst.VPSLLQ, Inst.VPADDQ], handle_arr_addr, "arr_addr", False, None) match_ld_ptr = ([Inst.XOR, Inst.VPCOMPRESSQ, Inst.VPBROADCASTQ, Inst.VPGATHERQQ], handle_load, "ld_ptr", False, None) match_st_ptr = ([Inst.XOR, Inst.VPCOMPRESSQ, Inst.VPCOMPRESSQ, Inst.VPSCATTERQQ], handle_store, "st_ptr", False, None) match_arr_init_1 = ([Inst.LEA, Inst.LEA, Inst.VMOVDQU64, Inst.VMOVDQU64], handle_arr_init, "handle_arr_init_1", True, match_arr_init_var) match_arr_init_2 = ([Inst.LEA, Inst.LEA, Inst.VMOVQ, Inst.VMOVQ], handle_arr_init, "handle_arr_init_2", True, match_arr_init_var) match_zarr_init1 = ([Inst.LEA, Inst.VPXORQ, Inst.VMOVDQU64], handle_zarr_init, "handle_arr_init_1", True, match_zarr_init_var) match_zarr_init2 = ([Inst.LEA, Inst.VPXORQ, Inst.VMOVQ], handle_zarr_init, "handle_arr_init_2", True, match_zarr_init_var) patterns = [ match_k_reg_setup, match_arith_general, match_risc_cmp, match_not, match_neg, match_div, match_mod, match_assign, match_jcc, match_broadcast, match_spill, match_shifts, match_var_shifts, match_stack_reload, match_stack_reload_variant, match_stack_spill, match_arr_idx, match_arr_addr, match_ld_ptr, match_st_ptr, match_arr_init_1, match_arr_init_2, match_zarr_init1, match_zarr_init2, ] def match_helper(disas, pattern, variadic, variadic_handler): if variadic: assert(variadic_handler is not None) assert(len(pattern) > 0 and len(disas) >= len(pattern)) if not variadic: assert(len(disas) == len(pattern)) ind_match = lambda x, y : x == y or (hasattr(y, '__iter__') and any([ind_match(x, _) for _ in y])) disas = [tokenize(d) for d in disas] disas_padded = disas + [None] * (max(len(disas), len(pattern)) - len(disas)) def _match_helper(): for x, y in zip(disas_padded, pattern): if not ind_match(x, y): return yield x result = [*_match_helper()] if not variadic: return (result, len(result) == len(pattern)) else: # initial match if len(result) == len(pattern): return (result + variadic_handler(disas[len(result):]), True) else: return (result, False) # return lists of tuple (address start, address end), pattern lambda, name, and list of disas def identify_matches(insts): results = [] def check_for_matches(rem): # compare to all remaining disas starting from start valid_patterns = [(rem[:len(p[0])] if not p[3] else rem, *p) for p in patterns if len(p[0]) <= len(rem)] return [*filter(lambda r : r[1], [(*match_helper(d, v, var, var_handler), l, n) for d,v,l,n,var,var_handler in valid_patterns])] start = 0 while start < len(insts): stub = [i.disas for i in insts[start:]] # match to all possible patterns left from this start point matches = check_for_matches([s[0] for s in stub]) if len(matches) > 0: maximum = max([len(m[0]) for m in matches]) best_match = [*filter(lambda m : len(m[0]) == maximum, [m for m in matches])] assert(len(best_match) == 1) best_match = best_match[0] end = len(best_match[0]) disas = stub[:end] results.append((AddrRange(insts[start].addr, insts[start + end].addr), best_match[2], best_match[3], disas)) start = start + end else: start += 1 return results # string disas matching partial_match = lambda x, y : all([a == b for a,b in zip(y, x)]) full_match = lambda x, y : len(x) == len(y) and partial_match(x, y) clean_token = lambda t : [str(i) for i in t if any([c for c in str(i) if c != ',' and c != ' '])] def pre_handler_hook(r, handler, n, d, verbose=False): if verbose: print(f'patching {r} to {n} for {d}') assert(r.start < r.end) result = handler(r, d, ctx) if verbose: if not result: print('patch failed') return result bv.begin_undo_actions() for f in FUNCTIONS: try: insts = [BinjaInst(clean_token(inst), addr) for inst,addr in f.instructions] insts = sorted(insts, key=lambda x : x[1]) push_rbp_exists, rbp_initialized, initial_stack_alloc = False, False, 0 try: push_rbp_exists = full_match(insts[0].disas, ['push', 'rbp']) rbp_initialized = full_match(insts[1].disas, ['mov', 'rbp', 'rsp']) if partial_match(insts[2].disas, ['sub', 'rsp']): initial_stack_alloc = int(insts[2].disas[2], 16) except IndexError: pass matches = identify_matches(insts) # only apply on obfuscated functions if len(matches) > 0: ctx = Context(initial_stack_alloc, push_rbp_exists, rbp_initialized, insts, f.start) redos = [*filter(lambda m : not pre_handler_hook(*m), [m for m in matches])] # fixed point attempt while len(redos) > 0: old_len = len(redos) new_redos = [*filter(lambda m : not pre_handler_hook(*m), [m for m in redos])] assert(old_len != len(new_redos)) redos = new_redos ctx.commit_patches() except Exception as e: traceback.print_exception(*sys.exc_info()) bv.update_analysis() bv.commit_undo_actions()

Onto a brief tldr of solving the crackme now (there really isn’t any interesting here), after all the above discussion on VM internals and deobfuscation. It’s basically a standard flag checker, asking for input, running checks on it, before using that to build a key to decrypt the flag embedded in the binary. An initial check is also done to verify that your CPU has the necessary AVX-512 features along with OS support for saving as well as restoring the zmm and mask registers.

Here are a few obfuscated functions in Binary Ninja (other mainstream decompilers produce equally bad results, if not just giving up and outputting the asm inlined):

Quite an unreadable mess… definitely very troubling for the type of reversers who spam F5.

I ran my deobfuscation script, with the following results:

Much for readable now! Now if the initial compatibility check is patched out, this binary can be run on x86 systems without AVX-512 support and debugged. The overall logic of the crackme was not that hard as it just involved taking 100 bytes of input, shuffling into a 10 by 10 matrix with the Blum Blum Shum prng, followed by a series of 10 transformations on each of the diagonals before comparing to a xor encrypted target answer. For the sake of completeness, the following is my solve script and the result of inputting the correct serial:

import ctypes enc_answer = [-0x2152411045521104, -0x2152411048bda34b, -0x2152411045533962, -0x215241104552497b, -0x21524110478627e4, -0x2152411045f40ff3, -0x215241104556c3b3, -0x2150511045520ff3, -0x215241104552026c, -0x21524110455209f7, -0x2152411048bda34a, -0x2152411045504a80, -0x2152411045522252, -0x2152411046c11ff3, -0x2152411045080ff3, -0x2152411045558a1e, -0x2153d91045520ff3, -0x215241104552151c, -0x215241104552026d, -0x2152411045520b6b, -0x215241104553c464, -0x215241104552455d, -0x2152411041276024, -0x2152411045d60ff3, -0x215241104551df4e, -0x2153291045520ff3, -0x2152411045522fac, -0x2152411045520752, -0x2152411045520bfa, -0x21524110455201e3, -0x2152411045524f4c, -0x2152411045030ff3, -0x2152411045e20ff3, -0x21524110455b5120, -0x2153291045520ff3, -0x21524110455203b3, -0x2152411045520b0a, -0x2152411045520b0a, -0x2152411045521bb3, -0x2150f11045520ff3, -0x2152411045d3b0e3, -0x2152411045da0ff3, -0x2152411045570cd0, -0x2150191045520ff3, -0x2152411045521794, -0x2152411045520add, -0x21524110455204d1, -0x2152411045521463, -0x2153c91045520ff3, -0x2152411045563f34, -0x2152411045e00ff3, -0x2152411045502ab3, -0x2150791045520ff3, -0x2152411045521264, -0x215241104552069e, -0x2152411045520a66, -0x21524110455211b3, -0x2150e11045520ff3, -0x215241104557342b, -0x2152411045d60ff3, -0x215241104556992a, -0x2150e91045520ff3, -0x2152411045521264, -0x2152411045520add, -0x2152411045520988, -0x2152411045521a97, -0x2153e91045520ff3, -0x2152411045558a1e, -0x2152411045f80ff3, -0x2152411047e3ede3, -0x2150511045520ff3, -0x21524110455215b7, -0x21524110455204d1, -0x2152411045520a40, -0x21524110455200f7, -0x2153c91045520ff3, -0x215241104557342b, -0x2152411045de0ff3, -0x2152411046ef5114, -0x21524110455248af, -0x215241104552050c, -0x2152411045520170, -0x215241104552049d, -0x2152411045520637, -0x2153811045520ff3, -0x21524110455b5120, -0x2152411045e60ff3, -0x2152411045d9af54, -0x2152411045523296, -0x2152411045546002, -0x2152411045520bfa, -0x2152411045520a40, -0x21524110455201e3, -0x2153e11045520ff3, -0x21524110455009da, -0x2152411045080ff3, -0x2152411044144ef3, -0x215241104552275c, -0x21524110455b5120, -0x2152411048bda353] answer = [ctypes.c_int64(ctypes.c_uint64(_).value ^ 0xdeadbeefbaadf00d).value for _ in enc_answer] charset = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-' bbs_val = 0x31337 def bbs(): m = 11134415137 global bbs_val bbs_val = ctypes.c_int64(bbs_val * bbs_val).value % m # print(bbs_val) return bbs_val rol = lambda val, r_bits, max_bits=64: \ (val << r_bits%max_bits) & (2**max_bits-1) | \ ((val & (2**max_bits-1)) >> (max_bits-(r_bits%max_bits))) ror = lambda val, r_bits, max_bits=64: \ ((val & (2**max_bits-1)) >> r_bits%max_bits) | \ (val << (max_bits-(r_bits%max_bits)) & (2**max_bits-1)) sigma = lambda x : sum([i for i in range(x + 1)]) transforms = [ lambda x : (x + 1) ** 2, lambda x : x ^ 0x0DEFACED, lambda x : (x - 2) ** 3, lambda x : x * 0x0000000FF1CE // 0x1337, lambda x : (x + 3) ** 4, lambda x : ctypes.c_int64(rol(x, 17)).value, lambda x : (x - 4) ** 3, lambda x : ctypes.c_int64(ror(x, 21)).value, lambda x : (x + 5) ** 2, lambda x : sigma(x), ] def reverse_transform (func, target): mapping = {} for c in charset: mapping[func(ord(c))] = c assert(len(mapping) == len(charset)) return mapping[target] array = [[0 for _ in range(10)] for __ in range(10)] for s in range(19): for i in range(10): for j in range(10): if i + j == s: array[i][j] = reverse_transform(transforms[s if s < 10 else 19 - s], answer[i * 10 + j]) shuffle_idx = {} seen = set() while len(seen) < 100: temp = bbs() % 100 while temp in seen: temp = bbs() % 100 x = temp // 10 y = temp % 10 shuffle_idx[(x, y)] = len(seen) seen.add(temp) answer = ['_' for _ in range(100)] for i in range(10): for j in range(10): idx = (i, j) answer[shuffle_idx[idx]] = array[i][j] print(''.join(answer))

Solvers of this challenge did not opt for the deobfuscation approach to my knowledge, as the obfuscated code was small enough that it can be statically analyzed with manual willpower or monkeyed around in a debugger (my least favorite approach to reversing).

In regards to the obfuscator’s development, it did not take me very long (only about 2-3 days total with anematode combined with debugging some of the codegen unit tests). This is in part thanks to 6.035, MIT’s optimizing compiler class for x86 - I had to write yet another compiler for a subset of C in the past semester. Compared to the homerolled compiler I made for last year’s vmquack’s revenge, this one was much more modular, with a clearly defined frontend, middle end IR that applied a variety of SSA and non-SSA based optimization passes, and backend for code generation, effectively giving me my own miniature LLVM.

All I had to do was add a new type of register for my register allocator and emit obfuscated code for all my IR in a new code generation module. This tooling is honestly quite useful, especially now that I can create arbitrarily complex C programs for any imaginary or real architecture and add passes along with new IR instructions in the middle-end for other obfuscation strategies. Spitting out stupidly toxic reversing challenges will be so easy from now on 😎

I hope you have learned something new from this writeup (at least something about AVX-512), and please do let me know if there any errors or further questions! Thanks once again to anematode for co-authoring this challenge with me (he is the most knowledgeable person I know around me in regards to x86 and SSE/AVX instructions and performance). A special shoutout must go to TheBadGod on team organizers for first-blooding it, and to TeamItaly and IrisSec for solving it during the CTF.

Will's Root

Search This Blog

Wednesday, August 2, 2023

corCTF 2023 vmquack's vectorized vices: VM Obfuscation through AVX-512

No comments:

Post a Comment