Advanced
Exploitation of Xen Hypervisor Sysret VM Escape Vulnerability
Hi
everyone,
In this new blog, we will share our technical analysis and advanced exploitation of a critical memory corruption vulnerability affecting Xen hypervisor (CVE-2012-0217) recently discovered by Rafal Wojtczuk and Jan Beulich.
This flaw affects systems with
Intel CPU hardware and
allows a Guest-to-Host escape. A local attacker within a guest virtual machine
will be able to escape his restricted virtual environment and execute arbitrary code on the host system with permissions of
the most privileged domain ("dom0") which has direct access to hardware and
can manage unprivileged domains ("domU").
If you run a virtualization or cloud infrastructure based on Xen, it is thus highly recommended to upgrade to version 4.1.3 which fixes this critical flaw.
1. Technical Analysis of the Vulnerability
The SYSCALL / SYSRET instructions allow for fast context switching between user and kernel land. As specified in the Intel
specifications, the SYSCALL instruction
jumps to the address specified at MSR_IA32_LSTAR. Under Xen, syscall handling
depends on whether or not the guest is fully virtualized HVM (Hardware Virtual
Machine) or a PV (ParaVirtualized) Guest:

In the case of an
HVM, the virtualized OS owns its own IDT and MSR registers, and thus does not
need the help of Xen to dispatch its interruptions or syscalls.
However this is different with a PV guest, since the underlying OS knows it is
being run under Xen. In order for a PV guest to work, its kernel has to be
modified. The kernel needs to run under ring1 rather than ring0.
The kernel registers its key structures like GDT, IDT, etc to Xen via hypercalls
which are the equivalent of syscalls but from kernel land to hypervisor land.
When a "SYSCALL" is performed from user land, it directly reaches a dispatcher
under Xen in ring0. Xen then devises whether or not the syscall is issued from
user or kernel land in "xen/x86/x86-64/entry.S":
/*
* When entering SYSCALL from kernel mode:
* %rax = hypercall vector
* %rdi, %rsi, %rdx, %r10, %r8, %9 = hypercall arguments
* %rcx = SYSCALL-saved %rip
* NB. We must move %r10 to %rcx for C function-calling ABI.
*
* When entering SYSCALL from user mode:
* Vector directly to the registered arch.syscall_addr.
*
* Initial work is done by per-CPU stack trampolines. At this point %rsp
* has been initialised to point at the correct Xen stack, and %rsp, %rflags
* and %cs have been saved. All other registers are still to be saved onto
* the stack, starting with %rip, and an appropriate %ss must be saved into
* the space left by the trampoline.
*/
ALIGN
ENTRY(syscall_enter)
sti
movl $FLAT_KERNEL_SS,24(%rsp)
pushq %rcx
pushq $0
movl $TRAP_syscall,4(%rsp)
movq 24(%rsp),%r11 /* Re-load user RFLAGS into %r11 before SAVE_ALL */
SAVE_ALL
GET_CURRENT(%rbx)
movq VCPU_domain(%rbx),%rcx
testb $1,DOMAIN_is_32bit_pv(%rcx)
jnz compat_syscall
testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
jz switch_to_kernel
|
In the case of user land, Xen propagates the
syscall into kernel using the registered IDT
structure provided by the kernel. In the case of a
syscall issued from kernel land (an hypercall) Xen
uses the real IDT to dispatch the system call in
"xen/x86/x86-64/entry.S":
ENTRY(syscall_enter)
sti
movl $FLAT_KERNEL_SS,24(%rsp)
pushq %rcx
pushq $0
movl $TRAP_syscall,4(%rsp)
movq 24(%rsp),%r11 /* Re-load user RFLAGS into %r11 before SAVE_ALL */
SAVE_ALL
GET_CURRENT(%rbx)
movq VCPU_domain(%rbx),%rcx
testb $1,DOMAIN_is_32bit_pv(%rcx)
jnz compat_syscall
testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
jz switch_to_kernel
...
/*hypercall:*/
movq %r10,%rcx
cmpq $NR_hypercalls,%rax
jae bad_hypercall ; exit in case of wrong hypercall number
...
leaq hypercall_table(%rip),%r10
PERFC_INCR(hypercalls, %rax, %rbx)
callq *(%r10,%rax,8) ; jump on hypercall handler
|
When returning from a hypercall, the following
code which returns to kernel land is eventually
reached in "xen/x86/x86-64/entry.S":
/* %rbx: struct vcpu, interrupts disabled */
restore_all_guest:
ASSERT_INTERRUPTS_DISABLED
RESTORE_ALL
testw $TRAP_syscall,4(%rsp)
jz iret_exit_to_guest
addq $8,%rsp
popq %rcx
; RIP
popq %r11
; CS
cmpw $FLAT_USER_CS32,%r11
popq %r11
; RFLAGS
popq %rsp
; RSP
je 1f
sysretq ; 64 bits kernel
1: sysretl
; 32 bits kernel
|
However, as specified in Intel manuals, there
exists a specific case where the SYSRET
instruction can fail. Indeed, when the RCX
register which corresponds to the user land RIP
return address, is not a canonical address the
processor will issue a #GP (General Protection
fault).
A canonical address is an address located in the
following ranges:
- 0x0000000000000000 - 0x00007FFFFFFFFFFF
- 0xFFFF800000000000 - 0xFFFFFFFFFFFFFFFF
This decision has been made in order to limit the
Page-Map Level to 4, instead of 6 required to map
the entire address space.
However, under the Intel architecture, the CPU
stays in ring0 when issuing the #GP while
its general purpose registers are already restored
to their kernel land values.
By controlling the general purpose registers, it
is possible to influence the hypervisor behavior
and gain code execution in the hypervisor context,
escaping the guest context.
2.
Triggering the Vulnerability on Xen and
Citrix XenServer
The fact that the SYSRET instruction is only
executed under the context of a hypercall implies
that the SYSCALL instruction must be executed from
kernel land. The obvious way to achieve this is to
use a kernel module.
In order to trigger the bug, one has to map memory
close to a non-canonical address and perform a
SYSCALL instruction in such a way that the address
of the instruction after the SYSCALL instruction
will point inside a non-canonical address.
Linux does not normally allow mmap()ing addresses
close to the canonical address such as
0x7FFFFFFFF000. A first trick is thus needed in
order to easily map this area (we did not want to
mess with the page table).
Under Linux the "mmap_region()" function in
"mm/mmap.c" is eventually called
by the "mmap()" function when
all parameters have been validated. Calling
directly this function with the right parameters
allows us to map the page 0x7FFFFFFFF000:
void *mapping;
unsigned int vm_flags;
vm_flags = VM_READ | VM_WRITE | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
mapping = p_mmap_region(NULL, 0x7ffffffff000, 0x1000,
MAP_ANON|MAP_PRIVATE|MAP_FIXED|MAP_POPULATE, vm_flags, 0);
|
As stated in the mmap manual,
MAP_POPULATE will populate the mapping in page
tables without waiting for a page fault to occur
before updating the page table. Under the kernel
module context, accessing the mapped page without
this option leads to a failure.
By mmap()ing the address 0x7FFFFFFFF000, the
kernel will allocate a page (0x1000 bytes)
starting at address 0x7FFFFFFFF000. Since the page
has a size of 0x1000 bytes, the allocated address
range will be from 0x7FFFFFFFF000 to
0x7FFFFFFFFFFF.
Since the SYSCALL opcode (0F 05) is 2 bytes,
having this instruction located at 0x7FFFFFFFFFFE
triggers the bug:
Address
7FFFFFFFF000
...
7FFFFFFFFFFE
SYSCALL
800000000000 Non-canonical
address
|
After calling the SYSCALL
instruction, the return address will be
0x800000000000 which is non-canonical, eventually
triggering the #GP with ring0 privileges
and attacker-controlled registers when executing
the SYSRET instruction.
3.
Advanced Exploitation on Xen and
Citrix XenServer
Exploitation has been achieved under a 64-bit
Linux PV guest running on Citrix XenServer 6.0.0
with Xen version 4.1.1. The method will work on
other versions as well.
3.1 - Rewriting the trap frame return
address
Since the guest is not fully virtualized,
using the "sidt" instruction actually returns the
address of the Xen IDT. So smashing the Xen IDT
allows arbitrary code execution. While the
exploitation is local, it is comparable to a
remote exploitation in the way that one does not
know how to repair the IDT after achieving code
execution. Even by using preserved IDT entries
such as Divide Error, and scanning forward in
order to reconstruct the IDT, in a lot of cases,
more than the IDT was mangled and revealed close to
impossible to reconstruct. Thus, this technique
works perfectly on some Xen versions but may fail
on others.
A more reliable method is thus needed in order to
exploit this vulnerability without knowing the Xen
address space.
I had the opportunity to talk to Rafal Wojtczuk (who
discovered this vulnerability) after his BlackHat
presentation about this flaw. He actually came up
with a clever idea which does not rely on any
hardcoded address but requires digging a little in
the internals of Xen exception handling.
The exploitation method relies on the way Xen
retrieves the "current" variable, which is a
pointer to a vCPU (Virtual CPU) structure which
holds information about the virtual machine state:
general purpose registers, virtual IDT, page
tables, etc.
Xen retrieves the current_vcpu pointer by using
the address of the bottom of the stack, thanks to
the "get_current()" macro:
struct cpu_info {
struct cpu_user_regs guest_cpu_user_regs;
unsigned int processor_id;
struct vcpu *current_vcpu;
unsigned long per_cpu_offset;
#ifdef __x86_64__ /* get_stack_bottom() must be 16-byte aligned */
unsigned long __pad_for_stack_bottom;
#endif
};
static inline struct cpu_info *get_cpu_info(void)
{
struct cpu_info *cpu_info;
__asm__ ( "and %%"__OP"sp,%0; or %2,%0"
: "=r" (cpu_info)
: "0" (~(STACK_SIZE-1)), "i" (STACK_SIZE-sizeof(struct cpu_info))
);
return cpu_info;
}
#define get_current() (get_cpu_info()->current_vcpu)
#define set_current(vcpu) (get_cpu_info()->current_vcpu = (vcpu))
#define current (get_current())
|
When dispatching the #GP
exception, the following code is reached in
"arch/x86/x86-64/entry.S":
/* No special register assumptions. */
ENTRY(handle_exception)
SAVE_ALL
handle_exception_saved:
testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%rsp)
jz exception_with_ints_disabled
sti
1: movq %rsp,%rdi
movl UREGS_entry_vector(%rsp),%eax
leaq exception_table(%rip),%rdx
GET_CURRENT(%rbx)
; Retrieve guest_cpu_user_regs
; of cpu_info structure
PERFC_INCR(exceptions, %rax, %rbx)
callq *(%rdx,%rax,8) ; call exception handler
|
When entering the "do_general_protection()" function in
"arch/x86/traps.c", several checks are
performed:
asmlinkage void
do_general_protection(struct cpu_user_regs *regs)
{
struct vcpu *v = current;
unsigned long fixup;
DEBUGGER_trap_entry(TRAP_gp_fault, regs);
if ( regs->error_code & 1 )
goto hardware_gp;
if ( !guest_mode(regs) )
goto gp_in_kernel;
...
gp_in_kernel:
...
DEBUGGER_trap_fatal(TRAP_gp_fault, regs);
hardware_gp:
show_execution_state(regs);
panic("GENERAL PROTECTION FAULT\n[error_code=%04x]\n", regs->error_code);
|
Under normal circumstances, the
#GP handler gracefully fails with a kernel panic,
which is not the desired behavior. However the
guest_mode macro can be tricked into thinking the
#GP is actually triggered from ring3:
#define
guest_mode(r)
\
({
\
unsigned long diff = (char *)guest_cpu_user_regs()
- (char *)(r);
\
/* Frame pointer must point into current CPU
stack. */
\
ASSERT(diff < STACK_SIZE);
\
/* If not a guest frame, it must be a hypervisor
frame. */
\
ASSERT((diff == 0) || (!vm86_mode(r) && (r->cs ==
__HYPERVISOR_CS))); \
/* Return TRUE if it's a guest frame. */ \
(diff == 0); \
})
|
The "guest_cpu_user_regs()"
macro is defined as follows in "xen/include/asm-x86/current.h":
#define
guest_cpu_user_regs() (&get_cpu_info()->guest_cpu_user_regs)
|
If one can make the diff
variable having a NULL value, the check will pass
and the exception will be treated as being issued
from ring3, another path to be followed.
The GET_CURRENT and "guest_cpu_user_regs()" respectively give the following assembly:
mov RBX,
0xFFFFFFFFFFF8000
and RBX, RSP
or RBX, 0x7FE8
mov RBX, [RBX]
and:
mov RAX, 0xFFFFFFFFFFFF8000
and RAX, RSP
or RAX, 0x7F18
|
Setting these two values is
easily achievable, since one can set RSP to a
user-controlled address and filled with controlled
values.
The exception is considered to be coming from ring3 and thus the following code path is executed:
if ( (regs->error_code & 3) == 2 )
{
...
}
else if ( is_pv_32on64_vcpu(v) && regs->error_code )
{
...
}
/* Emulate some simple privileged and I/O instructions. */
if ( (regs->error_code == 0) &&
emulate_privileged_op(regs)
)
{
trace_trap_one_addr(TRC_PV_EMULATE_PRIVOP, regs->eip);
return;
}
|
Eventually the "emulate_privileged_op()" is called in
"xen/x86/traps.c", but will generate a #PF in the
"read_descriptor()" sub function on
the following piece of code:
static int read_descriptor(unsigned int sel,
const struct vcpu *v,
const struct cpu_user_regs * regs,
unsigned long *base,
unsigned long *limit,
unsigned int *ar,
unsigned int vm86attr)
{
struct desc_struct desc;
if ( !vm86_mode(regs) )
{
if ( sel < 4)
desc.b = desc.a = 0;
else if ( __get_user(desc,
(const struct desc_struct *)(!(sel & 4)
? GDT_VIRT_START(v)
: LDT_VIRT_START(v))
+ (sel >> 3)) )
return 0;
|
The "__get_user()" macro actually uses the current segment selector as the
index in the kernel virtual GDT / IDT which causes
a #PF on an invalid read.
The "do_page_fault()" exception handler is then
reached in "xen/x86/traps.c":
asmlinkage void
do_page_fault(struct cpu_user_regs *regs)
{
unsigned long addr, fixup;
unsigned int error_code;
addr = read_cr2();
/* fixup_page_fault() might change regs->error_code, so cache it here. */
error_code = regs->error_code;
DEBUGGER_trap_entry(TRAP_page_fault, regs);
perfc_incr(page_faults);
if ( unlikely(fixup_page_fault(addr, regs) != 0) )
return;
if ( unlikely(!guest_mode(regs)) )
{
if ( spurious_page_fault(addr, error_code) )
return;
|
Since the RSP address has been
modified due to arguments pushed on the stack by
the nested exception, the guest_mode check is no
longer valid and thus calls the "spurious_page_fault()" function:
static int
__spurious_page_fault(unsigned long addr, unsigned int error_code)
{
unsigned long mfn, cr3 = read_cr3();
#if CONFIG_PAGING_LEVELS >= 4
l4_pgentry_t l4e, *l4t;
#endif
#if CONFIG_PAGING_LEVELS >= 3
l3_pgentry_t l3e, *l3t;
#endif
l2_pgentry_t l2e, *l2t;
l1_pgentry_t l1e, *l1t;
unsigned int required_flags, disallowed_flags;
/*
* We do not take spurious page faults in IRQ handlers as we do not
* modify page tables in IRQ context. We therefore bail here because
* map_domain_page() is not IRQ-safe.
*/
if ( in_irq() )
return 0;
...
return 1;
|
The "in_irq()" macro must not validate the if condition, because
in the case the "spurious_page_fault()" function returns 0, a
kernel panic is triggered. However, the "in_irq()" which gives the following assembly is user-controlled:
0xffff82c4801fe2bc
<+4> : mov rax,0xffffffffffff8000
0xffff82c4801fe2c3
<+11>: lea rcx,[rip+0x921b6] # 0xffff82c480290480 <irq_stat>
0xffff82c4801fe2ca
<+18>: and rax,rsp
0xffff82c4801fe2cd
<+21>: or rax,0x7f18
0xffff82c4801fe2d3
<+27>: mov edx,DWORD PTR [rax+0xc8]
0xffff82c4801fe2d9
<+33>: xor eax,eax
0xffff82c4801fe2db
<+35>: shl rdx,0x7
0xffff82c4801fe2df
<+39> : cmp DWORD PTR [rcx+rdx*1+0x8],0x0
|
Now we control the "current"
variable, which is contained in RAX. The index is
retrieved in a field contained in the current
structure which is used as an index in the
irq_stat array. By providing an index of 0, the
comparison succeeds and the check is bypassed, the
function eventually returns 1 and leaves the "do_page_fault()" handler.
The code flow thus returns on the instruction
right after the faulty instruction in "read_descriptor()" which triggered the #PF. The following
instructions lead to a return 0.
The return value of "read_descriptor()" is then
checked in "emulate_privileged_op()":
if ( !read_descriptor(regs->cs, v, regs,
&code_base, &code_limit, &ar,
_SEGMENT_CODE|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P) )
goto fail;
...
fail:
return 0;
|
The "emulate_privileged_op()" function thus returns 0, the if condition is then
skipped until the "go_guest_trap()" is reached,
which is where things are getting interesting:
/* Emulate some simple privileged and I/O instructions. */
if ( (regs->error_code == 0) &&
emulate_privileged_op(regs) )
{
trace_trap_one_addr(TRC_PV_EMULATE_PRIVOP, regs->eip);
return;
}
/* Pass on GPF as is. */
do_guest_trap(TRAP_gp_fault, regs, 1);
|
The "do_guest_trap()" function
is responsible of propagating the #GP fault to the
exception handler of the kernel. Since the
hypervisor has been tricked into thinking that the
exception is issued from ring3, it seems legit to
transfer its handling to the kernel:
static void do_guest_trap(int trapnr, const struct cpu_user_regs *regs, int use_error_code)
{
struct vcpu *v = current;
struct trap_bounce *tb;
const struct trap_info *ti;
trace_pv_trap(trapnr, regs->eip, use_error_code, regs->error_code);
tb = &v->arch.trap_bounce;
ti = &v->arch.guest_context.trap_ctxt[trapnr];
tb->flags = TBF_EXCEPTION;
tb->cs = ti->cs;
tb->eip = ti->address;
if ( use_error_code )
{
tb->flags |= TBF_EXCEPTION_ERRCODE;
tb->error_code = regs->error_code;
}
if ( TI_GET_IF(ti) )
tb->flags |= TBF_INTERRUPT;
if ( unlikely(null_trap_bounce(v, tb)) )
gdprintk(XENLOG_WARNING, "Unhandled %s fault/trap [#%d] "
"on VCPU %d [ec=%04x]\n",
trapstr(trapnr), trapnr, v->vcpu_id, regs->error_code);
}
|
Here is where the magic happens.
The current pointer is again used to initialize
the trap_bounce structure. The addresses of the
two structures are offsets from arch pointer which
is a field of the current pointer. One can easily
supply an arbitrary arch pointer such that the tb->cs
and tb->eip will overlap with respectively the
saved CS and RIP of the instruction after the
SYSRET instruction which generated the #GP fault.
The "do_general_protection()" function will then
return and eventually the following assembly code
will be executed:
/* %rbx:
struct vcpu, interrupts disabled */
restore_all_guest:
ASSERT_INTERRUPTS_DISABLED
RESTORE_ALL
testw
$TRAP_syscall,4(%rsp)
jz
iret_exit_to_guest
...
/* No special register assumptions. */
iret_exit_to_guest:
addq
$8,%rsp
.Lft0: iretq
|
When the IRETQ instruction is
reached, it will pop the user-controlled segment
selector and perform a far jump to the user-controlled
return address.
As stated earlier, the structures are located at
user-controlled locations, so no hardcoded address
is required, only offsets from structures are used.
Thus reliable code execution is achieved with ring0 privileges.
3.2 - Code Execution in Host Context "dom0"
Since the exploit is performed from kernel
land, which assumes that the attacker is already
root on the guest virtual machine (legitimately or
using a privilege escalation exploit), this sysret
vulnerability can be exploited to target the dom0 virtual machine which is the most privileged
domain, the only virtual machine which by default
has direct access to hardware. From dom0, the hypervisor can be managed and unprivileged domains
(domU) can be launched.
Under Citrix XenServer, dom0 is a 32-bit virtual
machine, this configuration is used for
performance reasons.
The strategy here will be to inject a dom0 root
process with a bindshell (or reverse shell)
payload in order to get a root shell from dom0. Trying to map
dom0 pages directly by playing with
page tables is not an easy task.
The same idea as in remote kernel exploitation will be used:
hijack the interrupt 0x80 syscall handler in order
to wait for an interruption from dom0 to occur.
When an interrupt is triggered from dom0, one is
assured that dom0 virtual pages are mapped into
memory.
Xen is mapped at a static location in all the
virtualized kernels in the same way that address
0xc0000000 is mapped in all 32-bit user land
processes.
Since all Xen memory is mapped as RWX from the
hypervisor point of view, one just has to
overwrite an unused Xen location with a new int
0x80 handler and overwrite the Xen 0x80 entry in
the IDT.
The 1st stage handler performs the following
actions:

If a user land syscall is
performed from dom0, the process context is saved
and replaced by an argument performing a "mmap()" syscall with RWX permissions. The process EIP is
set to return on an "int 0x80" instruction. The
original "int 0x80" handler is then called.
After the syscall has successfully been executed,
the process will return on the "int 0x80"
instruction and perform a syscall with EAX having
the value of the mmap()ed memory. The 2nd stage "int
0x80" handler is then reached:

The stage0 shellcode will "fork()" the original process and set
the parent process EIP again to the "int 0x80"
instruction, this time with the original
parameters so that the parent process continues
smoothly its execution without any fault.
The child eventually executes a classic ring3 shellcode and gives us the key to the kingdom0!
© Copyright VUPEN Security
|