lpcvoid's homepage - Dejunking and reconstructing bytecode

Dejunking and reconstructing bytecode

Published on: 2020-11-08

In my last blog post, here, I figured that the Python bytecode could be obfuscated somewhat. Turns out it is, and this post will be a run down of my findings and the problems I encountered. It is a work in progress, so there could be updates at some point. I'll change the title in that case.

This post concerns the obsolete 2.7 version of Python, since that's what the target application embeds.

Python bytecode is pretty simple. You have one byte of opcode, followed by two optional bytes of argument (that's what cPython, the reference implementation, calls it). A list of opcodes is contained in opcode.h. Opcodes that are smaller than 90 do not expect any argument, and are thus only one byte in size.

There exist disassemblers and even decompilers for this bytecode, and they work well for standard bytecode that was not messed with. The one I am looking at has two obfusications (that I am aware of at this point) though. Let's look at them.

Insertion of junk instructions to confuse disassemblers

Normally, a disassembler would look at a set of instructions and decode them into some sort of text representation sequentially. For example, take the following bytecode:

6401006400006c00007d0000

This decodes to:

LOAD_CONST      1 (-1)
LOAD_CONST      0 (None)
IMPORT_NAME     0 (sys)
STORE_FAST      0 (sys)

Which is simple enough. The disassembler relies on the bytecode being correct though for a number of factors. For one, it decides how many bytes of argument to read by looking at the integer value of the opcode. If it's under 90, it treats the bytes after the opcode as a new opcode and not as argument data. This means that it's possible to write a program that can stop disassemblers by taking bytecode, inserting junk instructions or random bytes into it at some place, and then recalculate jumps and branches to account for this. The VM does not care much when executing this manipulated bytecode, as long as the obfuscator took care to direct it around the junk using jumps or other means of control flow. But a disassembler that reads code top down without dynamic analysis will run into these junk areas and produce completely wrong output, or even crash.

So, what can we do against this? I decided to write a small program which takes bytecode and recursively traces it, running along all branches and marking code paths which can actually be executed. This is easy because there aren't that many branching opcodes. In fact, these are the ones I identified:

JUMP_FORWARD            110
JUMP_IF_FALSE_OR_POP    111
JUMP_IF_TRUE_OR_POP     112
JUMP_ABSOLUTE           113
POP_JUMP_IF_FALSE       114
POP_JUMP_IF_TRUE        115
FOR_ITER                93

All non executed areas of code then get a nop (0x9), so the disassembler does not stumble over junk bytes anymore. This is simple, but actually does not solve another problem: What happens when the junk is located in an actually executable path? Most of the time, it is located behind a JUMP_ABSOLUTE instruction, and that's easy enough to detect with the static algorithm I wrote. Consider this though:

LOAD_CONST         3
POP_JUMP_IF_TRUE   42 
...junk bytes here...
IMPORT_NAME        1  #address 42

This code pushes the const at index 3 onto the stack. POP_JUMP_IF_TRUE checks the top of the stack, and jumps to the absolute bytecode offset if it is (after popping the top). In this case, as long as the const array has a true value at offset 3 (which is easy to guarantee), it will always jump, thus being equivalent to JUMP_ABSOLUTE. This is something my small static analyzer does not know though currently. I need to teach it to read const values like that and evaluate those jumps, then it can skip those junk instructions too.

Insertion of junk constants and local names

Another thing that this obfuscator does is insert random strings into both the constant and local names. It then goes through the bytecode and increments all LOAD_CONST and LOAD_NAME opcode arguments by the number of junk strings added to the respective arrays. This is a simple but effective obfuscation. Fortunately, it is also simple to revert if you can identify junk locals and global names by some shared attribute. In this case, the junk names include at least one space character.

Open questions

I wonder why they did not obfuscate opcodes, or even change the opcode mapping. That would have been a royal pain, since then I would have needed to figure out what function maps to what integer value. Or, why did they not obfuscate arguments or length of argument bytes? This too would have taken some time to reverse. Writing a junk code insertion routine is a lot more challenging than changing a bit of cPython code. So why did they not go for these low hanging fruit? Maybe I do not know the Python VM enough to understand this. If you know or want to wager a guess, I would be interested to hear.