CCS'21: research paper

General topic

reverse engineering, deobfuscation, artificial intelligence, program synthesis

Motivation

Lets consider that you are reverse engineering a program (e.g. a malware). If you are unlucky (and you will probably be) this code snippet will be obfuscated i.e. translated to an equivalent complex program to make analysis harder. So how can you understand it ? You could try to simplify the code by hand but this is error prone and very time consuming. Thus, you may leverage automated symbolic-based methods. However, some tricky local code snippets can be so complex that analysis will fail or return incomprehensible information. This is where blackbox deobfuscation (introduced by Blazytko et al.) comes into play: it simplifies local code fragments that have been highly obfuscated and returns understandable version of them.

Example. Consider the following code snippet (written in C instead of assembly for better readability):

int a = 2*(x | 2*y) - (x ^ 2*y) - x - y;
int cond = 2*(x | y) - (~x & a) - (x & ~y);
if ( cond == 5 ) {
    ...
} else {
    ...
}

A natural question is: when is the condition verified or not ? Using usual symbolic methods, you will retrieve that the condition is verified if and only if (2*(x | y) - (~x & (2 * (x | 2*y) - (x ^ (2 * y)) - x - y)) - (x & ~y) == 5). This is not really helpful. However, if we apply blackbox deobfuscation we will easly found out that the condition is verified if and only if: x + y == 5. This is a lot more understandable.

Blackbox deobfuscation

But how does it work ? Blackbox deobfuscation relies on observed inputs/outputs (I/O) relations to synthesize an expression which mimics observed behaviors. For example, to synthesize the previous condition, we compute cond for distinct value of x and y. Thus, we get the following I/O relations:

x	y	cond
0	0	0
1	4	5
10	40	50
-5	20	15
0	10	10
...

From these I/O samples, a blackbox deobfuscator synthesizes a coherent expression. In our example, it would infer that cond = x + y and we can conclude that the condition is x + y == 5.

Goal. Performing efficient synthesis is crucial for blackbox deobfuscation. It must return correct and simple results while being fast and able to handle a wide variety of distinct expressions. Thus, the goal of the paper is to study how to perform efficient synthesis for blackbox deobfuscation. This leads us to discuss limits of the approach and propose the first anti-blackbox deobfuscation protections.

Major Contributions

In summary, this paper makes the following major contributions:

We extend the evaluation of the state-of-the-art blackbox deobfuscator Syntia, relying on Monte Carlo Tree Search (MCTS) for the synthesis, and highlight new strenghts and weaknesses of Syntia;
We promote the use of S-metaheuristics in place of MCTS for efficient blackbox deobfuscation. We implement our approach in a tool dubbed Xyntia and show that it highly outperforms Syntia, being faster and handling more complex expressions.
Finaly, we propose the first protections against blackbox deobfuscation.

Further information

Read the paper
To appear at the ACM Conference on Computer and Communications Security (CCS) 2021
Try (Xyntia) and get our (benchmarks)