About Us | Contact Us    



VUPEN Research

  VUPEN Research Team Blog

VUPEN Vulnerability Research Team (VRT) Blog

Advanced Exploitation of Xen Hypervisor Sysret VM Escape Vulnerability
Published on 2012-09-04 17:11:48 UTC by Jordan Gruskovnjak, Security Researcher @ VUPEN

Twitter LinkedIn Delicious Digg   

Hi everyone,

In this new blog, we will share our technical analysis and advanced exploitation of a critical memory corruption vulnerability affecting Xen hypervisor (CVE-2012-0217) recently discovered by Rafal Wojtczuk and Jan Beulich.

This flaw affects systems with Intel CPU hardware and allows a Guest-to-Host escape. A local attacker within a guest virtual machine will be able to escape his restricted virtual environment and execute arbitrary code on the host system with permissions of the most privileged domain ("dom0") which has direct access to hardware and can manage unprivileged domains ("domU").

If you run a virtualization or cloud infrastructure based on Xen, it is thus highly recommended to upgrade to version 4.1.3 which fixes this critical flaw.

1. Technical Analysis of the Vulnerability

The SYSCALL / SYSRET instructions allow for fast context switching between user and kernel land. As specified in the Intel specifications, the SYSCALL instruction jumps to the address specified at MSR_IA32_LSTAR. Under Xen, syscall handling depends on whether or not the guest is fully virtualized HVM (Hardware Virtual Machine) or a PV (ParaVirtualized) Guest:

In the case of an HVM, the virtualized OS owns its own IDT and MSR registers, and thus does not need the help of Xen to dispatch its interruptions or syscalls.

However this is different with a PV guest, since the underlying OS knows it is being run under Xen. In order for a PV guest to work, its kernel has to be modified. The kernel needs to run under ring1 rather than ring0.

The kernel registers its key structures like GDT, IDT, etc to Xen via hypercalls which are the equivalent of syscalls but from kernel land to hypervisor land.

When a "SYSCALL" is performed from user land, it directly reaches a dispatcher under Xen in ring0. Xen then devises whether or not the syscall is issued from user or kernel land in "xen/x86/x86-64/entry.S":

 * When entering SYSCALL from kernel mode:
 *  %rax                            = hypercall vector
 *  %rdi, %rsi, %rdx, %r10, %r8, %9 = hypercall arguments
 *  %rcx                            = SYSCALL-saved %rip
 *  NB. We must move %r10 to %rcx for C function-calling ABI.
 * When entering SYSCALL from user mode:
 *  Vector directly to the registered arch.syscall_addr.
 * Initial work is done by per-CPU stack trampolines. At this point %rsp
 * has been initialised to point at the correct Xen stack, and %rsp, %rflags
 * and %cs have been saved. All other registers are still to be saved onto
 * the stack, starting with %rip, and an appropriate %ss must be saved into
 * the space left by the trampoline.
        movl  $FLAT_KERNEL_SS,24(%rsp)
        pushq %rcx
        pushq $0
        movl  $TRAP_syscall,4(%rsp)
        movq  24(%rsp),%r11      /* Re-load user RFLAGS into %r11 before SAVE_ALL */
        movq  VCPU_domain(%rbx),%rcx
        testb $1,DOMAIN_is_32bit_pv(%rcx)
        jnz   compat_syscall
        testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
        jz    switch_to_kernel


In the case of user land, Xen propagates the syscall into kernel using the registered IDT structure provided by the kernel. In the case of a syscall issued from kernel land (an hypercall) Xen uses the real IDT to dispatch the system call in

        movl  $FLAT_KERNEL_SS,24(%rsp)
        pushq %rcx
        pushq $0
        movl  $TRAP_syscall,4(%rsp)
        movq  24(%rsp),%r11 /* Re-load user RFLAGS into %r11 before SAVE_ALL */
        movq  VCPU_domain(%rbx),%rcx
        testb $1,DOMAIN_is_32bit_pv(%rcx)
        jnz   compat_syscall
        testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
        jz    switch_to_kernel

        movq  %r10,%rcx
        cmpq  $NR_hypercalls,%rax
        jae   bad_hypercall                             ; exit in case of wrong hypercall number
        leaq  hypercall_table(%rip),%r10
        PERFC_INCR(hypercalls, %rax, %rbx)
        callq *(%r10,%rax,8)                     ; jump on hypercall handler

When returning from a hypercall, the following code which returns to kernel land is eventually reached in "xen/x86/x86-64/entry.S":

 /* %rbx: struct vcpu, interrupts disabled */
        testw $TRAP_syscall,4(%rsp)
        jz    iret_exit_to_guest

        addq  $8,%rsp
        popq  %rcx                                        ; RIP
        popq  %r11                                       
; CS
        cmpw  $FLAT_USER_CS32,%r11
        popq  %r11                                       
        popq  %rsp                                       
        je    1f
        sysretq ; 64 bits kernel
1:     sysretl
; 32 bits kernel

However, as specified in Intel manuals, there exists a specific case where the SYSRET instruction can fail. Indeed, when the RCX register which corresponds to the user land RIP return address, is not a canonical address the processor will issue a #GP (General Protection fault).

A canonical address is an address located in the following ranges:

- 0x0000000000000000 - 0x00007FFFFFFFFFFF
- 0xFFFF800000000000 - 0xFFFFFFFFFFFFFFFF

This decision has been made in order to limit the Page-Map Level to 4, instead of 6 required to map the entire address space.

However, under the Intel architecture, the CPU stays in ring0 when issuing the #GP while its general purpose registers are already restored to their kernel land values.

By controlling the general purpose registers, it is possible to influence the hypervisor behavior and gain code execution in the hypervisor context, escaping the guest context.

2. Triggering the Vulnerability on Xen and Citrix XenServer

The fact that the SYSRET instruction is only executed under the context of a hypercall implies that the SYSCALL instruction must be executed from kernel land. The obvious way to achieve this is to use a kernel module.

In order to trigger the bug, one has to map memory close to a non-canonical address and perform a SYSCALL instruction in such a way that the address of the instruction after the SYSCALL instruction will point inside a non-canonical address.

Linux does not normally allow mmap()ing addresses close to the canonical address such as 0x7FFFFFFFF000. A first trick is thus needed in order to easily map this area (we did not want to mess with the page table).

Under Linux the "mmap_region()" function in "mm/mmap.c" is eventually called by the "mmap()" function when all parameters have been validated. Calling directly this function with the right parameters allows us to map the page 0x7FFFFFFFF000:

  void    *mapping;
  unsigned int vm_flags;

  mapping = p_mmap_region(NULL, 0x7ffffffff000, 0x1000,
              MAP_ANON|MAP_PRIVATE|MAP_FIXED|MAP_POPULATE, vm_flags, 0);


As stated in the mmap manual, MAP_POPULATE will populate the mapping in page tables without waiting for a page fault to occur before updating the page table. Under the kernel module context, accessing the mapped page without this option leads to a failure.

By mmap()ing the address 0x7FFFFFFFF000, the kernel will allocate a page (0x1000 bytes) starting at address 0x7FFFFFFFF000. Since the page has a size of 0x1000 bytes, the allocated address range will be from 0x7FFFFFFFF000 to 0x7FFFFFFFFFFF.

Since the SYSCALL opcode (0F 05) is 2 bytes, having this instruction located at 0x7FFFFFFFFFFE triggers the bug:

 800000000000        Non-canonical address


After calling the SYSCALL instruction, the return address will be 0x800000000000 which is non-canonical, eventually triggering the #GP with ring0 privileges and attacker-controlled registers when executing the SYSRET instruction.

3. Advanced Exploitation on Xen and Citrix XenServer

Exploitation has been achieved under a 64-bit Linux PV guest running on Citrix XenServer 6.0.0 with Xen version 4.1.1. The method will work on other versions as well.

3.1 - Rewriting the trap frame return address

Since the guest is not fully virtualized, using the "sidt" instruction actually returns the address of the Xen IDT. So smashing the Xen IDT allows arbitrary code execution. While the exploitation is local, it is comparable to a remote exploitation in the way that one does not know how to repair the IDT after achieving code execution. Even by using preserved IDT entries such as Divide Error, and scanning forward in order to reconstruct the IDT, in a lot of cases, more than the IDT was mangled and revealed close to impossible to reconstruct. Thus, this technique works perfectly on some Xen versions but may fail on others.

A more reliable method is thus needed in order to exploit this vulnerability without knowing the Xen address space.

I had the opportunity to talk to Rafal Wojtczuk (who discovered this vulnerability) after his BlackHat presentation about this flaw. He actually came up with a clever idea which does not rely on any hardcoded address but requires digging a little in the internals of Xen exception handling.

The exploitation method relies on the way Xen retrieves the "current" variable, which is a pointer to a vCPU (Virtual CPU) structure which holds information about the virtual machine state: general purpose registers, virtual IDT, page tables, etc.

Xen retrieves the current_vcpu pointer by using the address of the bottom of the stack, thanks to the "get_current()" macro:

    struct cpu_info {
    struct cpu_user_regs guest_cpu_user_regs;
    unsigned int processor_id;
    struct vcpu *current_vcpu;
    unsigned long per_cpu_offset;
 #ifdef __x86_64__ /* get_stack_bottom() must be 16-byte aligned */
    unsigned long __pad_for_stack_bottom;

 static inline struct cpu_info *get_cpu_info(void)
    struct cpu_info *cpu_info;
    __asm__ ( "and %%"__OP"sp,%0; or %2,%0"
              : "=r" (cpu_info)
              : "0" (~(STACK_SIZE-1)), "i" (STACK_SIZE-sizeof(struct cpu_info))
    return cpu_info;

 #define get_current()            (get_cpu_info()->current_vcpu)
 #define set_current(vcpu)     (get_cpu_info()->current_vcpu = (vcpu))
 #define current                    (get_current())


When dispatching the #GP exception, the following code is reached in "arch/x86/x86-64/entry.S":

 /* No special register assumptions. */
        testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%rsp)
        jz    exception_with_ints_disabled
 1:      movq  %rsp,%rdi
        movl  UREGS_entry_vector(%rsp),%eax
        leaq  exception_table(%rip),%rdx
        GET_CURRENT(%rbx)                          ;  Retrieve guest_cpu_user_regs
                                                                   ; of cpu_info structure
        PERFC_INCR(exceptions, %rax, %rbx)
        callq *(%rdx,%rax,8)                       ; call exception handler


When entering the "do_general_protection()" function in "arch/x86/traps.c", several checks are performed:

 asmlinkage void do_general_protection(struct cpu_user_regs *regs)
    struct vcpu *v = current;
    unsigned long fixup;

    DEBUGGER_trap_entry(TRAP_gp_fault, regs);

    if ( regs->error_code & 1 )
        goto hardware_gp;

    if ( !guest_mode(regs) )
        goto gp_in_kernel;


    DEBUGGER_trap_fatal(TRAP_gp_fault, regs);

    panic("GENERAL PROTECTION FAULT\n[error_code=%04x]\n", regs->error_code);


Under normal circumstances, the #GP handler gracefully fails with a kernel panic, which is not the desired behavior. However the guest_mode macro can be tricked into thinking the #GP is actually triggered from ring3:

 #define guest_mode(r)                                                                                   \
      unsigned long diff = (char *)guest_cpu_user_regs() - (char *)(r);
      /* Frame pointer must point into current CPU stack. */
      ASSERT(diff < STACK_SIZE);
      /* If not a guest frame, it must be a hypervisor frame. */
      ASSERT((diff == 0) || (!vm86_mode(r) && (r->cs == __HYPERVISOR_CS))); \
      /* Return TRUE if it's a guest frame. */ \
      (diff == 0); \


The "guest_cpu_user_regs()" macro is defined as follows in "xen/include/asm-x86/current.h":

 #define guest_cpu_user_regs() (&get_cpu_info()->guest_cpu_user_regs)

If one can make the diff variable having a NULL value, the check will pass and the exception will be treated as being issued from ring3, another path to be followed.

The GET_CURRENT and "guest_cpu_user_regs()" respectively give the following assembly:

 and RBX, RSP
 or RBX, 0x7FE8
 mov RBX, [RBX]


 and RAX, RSP
 or RAX, 0x7F18


Setting these two values is easily achievable, since one can set RSP to a user-controlled address and filled with controlled values.

The exception is considered to be coming from ring3 and thus the following code path is executed:

    if ( (regs->error_code & 3) == 2 )
    else if ( is_pv_32on64_vcpu(v) && regs->error_code )

    /* Emulate some simple privileged and I/O instructions. */
    if ( (regs->error_code == 0) &&
         emulate_privileged_op(regs) )
        trace_trap_one_addr(TRC_PV_EMULATE_PRIVOP, regs->eip);


Eventually the "emulate_privileged_op()" is called in "xen/x86/traps.c", but will generate a #PF in the "read_descriptor()" sub function on the following piece of code:

 static int read_descriptor(unsigned int sel,
                           const struct vcpu *v,
                           const struct cpu_user_regs * regs,
                           unsigned long *base,
                           unsigned long *limit,
                           unsigned int *ar,
                           unsigned int vm86attr)
    struct desc_struct desc;

    if ( !vm86_mode(regs) )
        if ( sel < 4)
            desc.b = desc.a = 0;
        else if ( __get_user(desc,
                        (const struct desc_struct *)(!(sel & 4)
                                                     ? GDT_VIRT_START(v)
                                                     : LDT_VIRT_START(v))
                        + (sel >> 3)) )
 return 0;

The "__get_user()" macro actually uses the current segment selector as the index in the kernel virtual GDT / IDT which causes a #PF on an invalid read.

The "do_page_fault()" exception handler is then reached in "xen/x86/traps.c":

 asmlinkage void do_page_fault(struct cpu_user_regs *regs)
    unsigned long addr, fixup;
    unsigned int error_code;

    addr = read_cr2();

    /* fixup_page_fault() might change regs->error_code, so cache it here. */
    error_code = regs->error_code;

    DEBUGGER_trap_entry(TRAP_page_fault, regs);


    if ( unlikely(fixup_page_fault(addr, regs) != 0) )

    if ( unlikely(!guest_mode(regs)) )
        if ( spurious_page_fault(addr, error_code) )


Since the RSP address has been modified due to arguments pushed on the stack by the nested exception, the guest_mode check is no longer valid and thus calls the "spurious_page_fault()" function:

 static int __spurious_page_fault(unsigned long addr, unsigned int error_code)
    unsigned long mfn, cr3 = read_cr3();
    l4_pgentry_t l4e, *l4t;
    l3_pgentry_t l3e, *l3t;
    l2_pgentry_t l2e, *l2t;
    l1_pgentry_t l1e, *l1t;
    unsigned int required_flags, disallowed_flags;

     * We do not take spurious page faults in IRQ handlers as we do not
     * modify page tables in IRQ context. We therefore bail here because
     * map_domain_page() is not IRQ-safe.
    if ( in_irq() )
        return 0;
 return 1;

The "in_irq()" macro must not validate the if condition, because in the case the "spurious_page_fault()" function returns 0, a kernel panic is triggered. However, the "in_irq()" which gives the following assembly is user-controlled:

 0xffff82c4801fe2bc <+4>  : mov rax,0xffffffffffff8000
 0xffff82c4801fe2c3 <+11>: lea rcx,[rip+0x921b6]   # 0xffff82c480290480 <irq_stat>
 0xffff82c4801fe2ca <+18>: and rax,rsp
 0xffff82c4801fe2cd <+21>: or rax,0x7f18
 0xffff82c4801fe2d3 <+27>: mov edx,DWORD PTR [rax+0xc8]
 0xffff82c4801fe2d9 <+33>: xor eax,eax
 0xffff82c4801fe2db <+35>: shl rdx,0x7
 0xffff82c4801fe2df <+39> : cmp DWORD PTR [rcx+rdx*1+0x8],0x0

Now we control the "current" variable, which is contained in RAX. The index is retrieved in a field contained in the current structure which is used as an index in the irq_stat array. By providing an index of 0, the comparison succeeds and the check is bypassed, the function eventually returns 1 and leaves the "do_page_fault()" handler.

The code flow thus returns on the instruction right after the faulty instruction in "read_descriptor()" which triggered the #PF. The following instructions lead to a return 0.

The return value of "read_descriptor()" is then checked in "emulate_privileged_op()":

 if ( !read_descriptor(regs->cs, v, regs,
                          &code_base, &code_limit, &ar,
                          _SEGMENT_CODE|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P) )
        goto fail;

    return 0;


The "emulate_privileged_op()" function thus returns 0, the if condition is then skipped until the "go_guest_trap()" is reached, which is where things are getting interesting:

    /* Emulate some simple privileged and I/O instructions. */
    if ( (regs->error_code == 0) &&
         emulate_privileged_op(regs) )
        trace_trap_one_addr(TRC_PV_EMULATE_PRIVOP, regs->eip);

    /* Pass on GPF as is. */
    do_guest_trap(TRAP_gp_fault, regs, 1);


The "do_guest_trap()" function is responsible of propagating the #GP fault to the exception handler of the kernel. Since the hypervisor has been tricked into thinking that the exception is issued from ring3, it seems legit to transfer its handling to the kernel:

 static void do_guest_trap(int trapnr, const struct cpu_user_regs *regs, int use_error_code)
    struct vcpu *v = current;
    struct trap_bounce *tb;
    const struct trap_info *ti;

    trace_pv_trap(trapnr, regs->eip, use_error_code, regs->error_code);

    tb = &v->arch.trap_bounce;
    ti = &v->arch.guest_context.trap_ctxt[trapnr];

    tb->flags = TBF_EXCEPTION;
    tb->cs    = ti->cs;
    tb->eip   = ti->address;

    if ( use_error_code )
        tb->flags |= TBF_EXCEPTION_ERRCODE;
        tb->error_code = regs->error_code;

    if ( TI_GET_IF(ti) )
        tb->flags |= TBF_INTERRUPT;

    if ( unlikely(null_trap_bounce(v, tb)) )
        gdprintk(XENLOG_WARNING, "Unhandled %s fault/trap [#%d] "
                 "on VCPU %d [ec=%04x]\n",
                 trapstr(trapnr), trapnr, v->vcpu_id, regs->error_code);

Here is where the magic happens. The current pointer is again used to initialize the trap_bounce structure. The addresses of the two structures are offsets from arch pointer which is a field of the current pointer. One can easily supply an arbitrary arch pointer such that the tb->cs and tb->eip will overlap with respectively the saved CS and RIP of the instruction after the SYSRET instruction which generated the #GP fault.

The "do_general_protection()" function will then return and eventually the following assembly code will be executed:

 /* %rbx: struct vcpu, interrupts disabled */
               testw $TRAP_syscall,4(%rsp)
               jz iret_exit_to_guest
 /* No special register assumptions. */
               addq $8,%rsp
 .Lft0: iretq


When the IRETQ instruction is reached, it will pop the user-controlled segment selector and perform a far jump to the user-controlled return address.

As stated earlier, the structures are located at user-controlled locations, so no hardcoded address is required, only offsets from structures are used. Thus reliable code execution is achieved with ring0 privileges.

3.2 - Code Execution in Host Context "dom0"

Since the exploit is performed from kernel land, which assumes that the attacker is already root on the guest virtual machine (legitimately or using a privilege escalation exploit), this sysret vulnerability can be exploited to target the dom0 virtual machine which is the most privileged domain, the only virtual machine which by default has direct access to hardware. From dom0, the hypervisor can be managed and unprivileged domains (domU) can be launched.

Under Citrix XenServer, dom0 is a 32-bit virtual machine, this configuration is used for performance reasons.

The strategy here will be to inject a dom0 root process with a bindshell (or reverse shell) payload in order to get a root shell from dom0. Trying to map dom0 pages directly by playing with page tables is not an easy task.

The same idea as in remote kernel exploitation will be used: hijack the interrupt 0x80 syscall handler in order to wait for an interruption from dom0 to occur. When an interrupt is triggered from dom0, one is assured that dom0 virtual pages are mapped into memory.

Xen is mapped at a static location in all the virtualized kernels in the same way that address 0xc0000000 is mapped in all 32-bit user land processes.

Since all Xen memory is mapped as RWX from the hypervisor point of view, one just has to overwrite an unused Xen location with a new int 0x80 handler and overwrite the Xen 0x80 entry in the IDT.

The 1st stage handler performs the following actions:

If a user land syscall is performed from dom0, the process context is saved and replaced by an argument performing a "mmap()" syscall with RWX permissions. The process EIP is set to return on an "int 0x80" instruction. The original "int 0x80" handler is then called.

After the syscall has successfully been executed, the process will return on the "int 0x80" instruction and perform a syscall with EAX having the value of the mmap()ed memory. The 2nd stage "int 0x80" handler is then reached:

The stage0 shellcode will "fork()" the original process and set the parent process EIP again to the "int 0x80" instruction, this time with the original parameters so that the parent process continues smoothly its execution without any fault. The child eventually executes a classic ring3 shellcode and gives us the key to the kingdom0! 

Copyright VUPEN Security


VUPEN Solutions  











2004-2015 VUPEN Security