Sun Jun 18 2017 #dma #virtualization #x86

How Server Virtualization's Memory Segmentation Works

Table of Contents

Virtual Memory

To understand server virtualization, it helps to know a little about virtual memory implementations.

When reading this, please keep in mind that I’ve completely skipped past some important machinery such as the TLB, caches, speculative page-table walking, etc. that make a lot of this operate quickly.

User-space processes, like your shell or editor, have their own view of the memory. Below, we see how a trivial sample program thinks memory is organized:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main (int argc, char *argv[]) {
  char *fmtstr = "main        is at %p\nthis string is at %p\n";
  char *heap = calloc(1, 1);

  printf(fmtstr, main, fmtstr);
  printf("heap        is at %p\n", heap);
  printf("stack       is at %p\n", &argc);
  printf("\n");
  printf("Sleeping for about 10s before I exit");
  setvbuf(stdout, NULL, _IONBF, 0);
  for (int i = 0; i < 10; i++) {
    printf(".");
    sleep(1);
  }
  printf("\n");
}
jsw@hal:/mnt/c/Users/jsw$ ./print_address
main        is at 0x40069d
this string is at 0x400808
heap        is at 0x249a010
stack       is at 0x7ffff4f1c6ec

Sleeping for about 10s before I exit.....

Before it exits, we have a look at its memory maps in in a second terminal:

jsw@hal:/mnt/c/Users/jsw$ cat /proc/1042/maps
00400000-00401000 r-x- 00000000 00:00 13880                      /mnt/c/Users/jsw/print_address
00600000-00601000 r--- 00000000 00:00 13880                      /mnt/c/Users/jsw/print_address
00601000-00602000 rw-- 00001000 00:00 13880                      /mnt/c/Users/jsw/print_address
0249a000-024bb000 rw-- 00000000 00:00 0                          [heap]
7fd190830000-7fd1909ee000 r-x- 00000000 00:00 751161             /lib/x86_64-linux-gnu/libc-2.19.so
7fd1909ee000-7fd1909f5000 ---- 001be000 00:00 751161             /lib/x86_64-linux-gnu/libc-2.19.so
7fd1909f5000-7fd190bed000 ---- 00000000 00:00 0
7fd190bed000-7fd190bf1000 r--- 001bd000 00:00 751161             /lib/x86_64-linux-gnu/libc-2.19.so
7fd190bf1000-7fd190bf3000 rw-- 001c1000 00:00 751161             /lib/x86_64-linux-gnu/libc-2.19.so
7fd190bf3000-7fd190bf8000 rw-- 00000000 00:00 0
7fd190c00000-7fd190c23000 r-x- 00000000 00:00 751183             /lib/x86_64-linux-gnu/ld-2.19.so
7fd190e22000-7fd190e23000 r--- 00022000 00:00 751183             /lib/x86_64-linux-gnu/ld-2.19.so
7fd190e23000-7fd190e24000 rw-- 00023000 00:00 751183             /lib/x86_64-linux-gnu/ld-2.19.so
7fd190e24000-7fd190e25000 rw-- 00000000 00:00 0
7fd190f50000-7fd190f52000 rw-- 00000000 00:00 0
7fd190f60000-7fd190f61000 rw-- 00000000 00:00 0
7fd190f70000-7fd190f72000 rw-- 00000000 00:00 0
7ffff471e000-7ffff4f1e000 rw-- 00000000 00:00 0                  [stack]
7ffff54e2000-7ffff54e3000 r-x- 00000000 00:00 0                  [vdso]

All those hexadecimal addresses end in 0x000. That’s because the x86 CPU’s virtual memory support implements paging with 4KB page size. When memory is allocated to user-space processes with protection from each-other, it is done in 4KB (or larger) chunks.

A graphical view of the process’s virtual memory space may look like this: Process View of Memory

The Page Table

The page table, actually an m-trie on modern x86, maps virtual addresses to physical RAM. Through various tricks, the operating system can implement disk swapping, shared memory, over-committing, and other optimizations. It also allows separating processes from each-other; they can’t read or write data belonging to other processes. AMD64 Long-Mode Page Table Entry

File-Backed Pages

File-backed pages are used by every process on a Linux system. You can run ten copies of bash or perl and the immutable portions of the executable and libraries, such as libc, will be shared among all ten. This saves you a lot of RAM.

The writable bit of my process’s page table entries mapping its libc page at 0x7fd190830000 will be cleared to 0. This prevents me from trashing the library code which is also in use by other processes in the system.

Virtual Memory to Physical Memory Mappings

In addition, the permission bits vary for the pages of my print_address program, too. That’s because part of print address contains executable code, part contains constant data, and part of it can even contain copy-on-write data!

User-Space Segmentation Faults

If you try to access an unmapped page, your program will experience a segmentation fault. This means your program is buggy and the memory protection system has done its job. For example, below, we try to access memory at address 0x0. This very common bug is called a null-pointer dereference:

/// buggy_program1.c
jsw@hal:/mnt/c/Users/jsw$ cat buggy_program1.c
#include <stdio.h>

int main (int argc, char *argv[]) {
  int *magic_number;
  magic_number = 0; // set magic_number POINTER to point to address 0x0

  printf("trying to access a value at 0x0 will fail: %i\n", *magic_number);
}
jsw@hal:/mnt/c/Users/jsw$ gcc -std=gnu99 buggy_program1.c -o buggy_program1 && ./buggy_program1
Segmentation fault (core dumped)

The same thing also happens if your program tries to write to memory that is read-only. For example, buggy_program.c tries to write to a constant string which was loaded from the data section of the buggy_program object file. The result is a segmentation fault:

jsw@hal:/mnt/c/Users/jsw$ cat buggy_program2.c
#include <string.h>

int main (int argc, char *argv[]) {
  char *fmtstr = "main        is at %p\nthis string is at %p\n";

  strncpy(fmtstr, "source string is here", 5);
}

jsw@hal:/mnt/c/Users/jsw$ gcc -std=gnu99 buggy_program2.c -o buggy_program2 && ./buggy_program
Segmentation fault (core dumped)

Page Faults

Page Faults can be thought of as a superset of segmentation faults and other conditions, many of which are normal implementation effects of a virtual memory system.

For example, when the operating system wants to swap data out to the disk, it can choose a page and clear the present bit for that page. The OS then copies the data from RAM to disk. When the copy is complete, the OS can re-purpose the memory.

If a user-space process needs to read or write to that memory again, the CPU produces a page fault. This allows the OS to allocate another page of physical RAM, copy the data from disk to RAM, and resume user-space execution.

The page table entries contain Accessed and Dirty flags so the CPU and OS can work together to, among other things, minimize disk I/O by avoiding repeated copies from RAM to disk if the contents of a page haven’t changed (they aren’t dirty) and to avoid swapping out pages which are in frequent use (accessed flag is frequently set by the CPU.)

Responsibilities

The hardware is responsible for:

  • translating virtual memory addresses to physical ones by walking (reading) the page table
  • updating dirty & accessed flags in page table entries
  • raising page faults to the OS when the page table doesn’t contain a valid mapping to physical RAM or a process tries to exceed its permission on a page

The OS is responsible for:

  • maintaining page tables for each process
  • switching the page tables upon context swap from one process to another
  • retrieving swapped-out data from disk
  • fulfilling over-commitments by allocating promised RAM when it is first accessed or written
  • sending SIGSEGV to processes that try to access addresses which have never been allocated

Server Virtualization – Virtual Physical Memory

Hardware server virtualization works very similarly to what you’ve just learned about regular virtual memory implementations. Basically, one or more guest OSes have independent memory address spaces, and they can’t access any memory outside their own space.

A hypervisor, or host operating system, maintains a second page table informing the CPU of how to map the guest’s memory addresses to physical ones. AMD originally called their implementation the next-level page table until marketing realized it doubled performance and named it something more exciting!

In a virtualized environment, translating logical user-space addresses to physical RAM addresses therefore looks like this: next-level page tables

Virtual Devices (NIC, Disk, etc.)

A virtual machine wouldn’t be very useful without a network or a disk. Hypervisors implement virtual NICs and block devices by essentially appearing to be real devices.

In short, paging tricks implement bounce buffers shared by the guest OS and the host OS which allow for emulating a NIC card or a block storage device. Data going between the VM and physical NIC or SSD/HDD has to pass through the hypervisor, though. This has a significant affect on performance. Virtualized I/O is slow whether the guest sees the device as a specialized VNIC (e.g. kvm virtio) or if it thinks the emulated device is a popular hardware device (emulating Intel NICs is common.)

virtual IO diagram

PCI Passthru and SR-IOV

In systems with an adequate IO-MMU (for address translation) PCI devices can be used directly by guest operating systems. If the device, hypervisor, and guest’s device driver all support SR-IOV, that device can even be shared by multiple guests at the same time.

This avoids most of the performance penalty associated with emulated virtual devices!

IO-MMU

How does it work? The IO-MMU uses the guest-to-physical-address page table to translate memory addresses accessed by the PCI device.

IO MMU diagram

The PCI device itself needs to either be dedicated entirely to the guest VM, or to have support for this capability, referred to as SR-IOV. That’s because the IO-MMU needs to know which page table to use for each PCI DMA transaction.

If a device is used by just one guest, all its DMA reads and writes can be translated through the page table used for that guest’s memory.

If the device is in use by several guests, though, the PCI card needs to identify each DMA transaction with a guest-specific virtual function. This is where the complex requirements arise.

Common Limitations

Many NIC cards support SR-IOV but have limited features when used in that mode. For example, it is very common for the virtual NICs to be limited to one pair (Tx & Rx) of DMA queues each. This can be a serious limitation for network performance. For example: NIC DMA Queues