What do a matryoshka doll and a fukuruma doll have to do with operating systems?
(Images from www.cse.ucsd.edu/~saul/images/matryoshka.jpg and http://russian-crafts.com/nest/history/fucu.jpg.)
How many of you have used VMware, Parallels or Xen? (VMware, by the way, claims to be the fastest-growing software company in history.) Then you have used a virtual machine (VM) and virtual machine monitor (VMM). Because an operating system usually runs in supervisor mode, VMMs are also referred to as hypervisors.
The basic goal of a hypervisor is to allow multiple operating systems to run on the same hardware at the same time. This is not simply dual boot, but dynamic sharing of the CPU, memory and other resources, the same as different processes share the system in a multitasking OS. Moreover, the different instances of the operating systems, known as guest OSes can be heterogeneous.
In 1974, Popek and Goldberg defined it this way (adapted):
"A virtual machine is taken to be an efficient, isolated duplicate of the real machine...As a piece of software, a VMM [virtual machine monitor] has three essential characteristics.
- First, the VMM provides an environment for programs which is essentially identical with the original machine;
- second, programs run in this environment show at worst only minor decreases in speed; and
- last, the VMM is in complete control of system resources."
Today, we would augment/relax those conditions:
Below is an image of VMware's Vmotion, which allows live migration of a server from one physical machine to another.
You may not realize that virtual machine technology actually goes back to the 1960s. The original goal of IBM's VM was to completely virtualize the underlying hardware. IBM had the distinct advantage that it could change both the hardware and the software. Their primary goal was sharing of the hardware for legacy software; the pre-existing APIs gave applications (typically large databases) very direct access to the device controllers, and they wanted to preserve the customers' investment in that software while making the hardware shareable, in order to bring down total cost of ownership (TCO).
IBM's ability to control the hardware had a huge advantage: they didn't have to worry about thousands of different types of peripherals, odd trap semantics and page tables, etc. The biggest initial hurdle to a truly useful VMM is that plethora of peripherals. VMware solves this difficult problem by using a Host OS in addition to the Guest OS. Their chosen Host OS is Linux.
The Host OS actually performs the I/O on behalf of the VMM, using the host OS's device drivers. The VMM communicates with a process running inside the VM to execute the I/Os and return the results, regardless of which VM actually requested the data.
Hardware support for all of these things certainly makes life a lot simpler. Intel has support for both IA-32 (x86) and IA-64 (Itanium) in the newest round of chips. Sun has support in UltraSPARC. With these modifications, binary translation and the source code modification for paravirtualization are both unnecessary.
The single biggest problem presented by the IA-32 architecture is that some operations that a user process might not be privileged to execute are silently ignored by the hardware. For a VMM, you would much prefer that the hardware traps. VMware solves this problem by dynamically modifying the binary code of the OS to trap to the VMM. This approach involves a lot of execution overhead when the system first starts, but should run well once the cache of modified code is warm.
Xen's approach is to modify the OS source code to support cooperating with the VMM. They have successfully used this approach for Linux and Windows.
A major problem on Intel architectures is ring compression. On IA-32, there are four protection rings, 0-3. Traditionally, the OS uses ring 0 ("supervisor mode"), and applications use ring 3, with separate page tables. Rings 1 and 2 were used by OS/2, but no important OS today uses them. They are similar to Multics' protection rings, another important idea from the 1960s, but out of fashion for much of the last thirty years.
The obvious solution would be to have the hypervisor run in ring 0, the OS in ring 1, and the apps in ring 3. However, there's a problem: the MMU doesn't distinguish among rings 0, 1, and 2. All three of them can change the page tables at will, and all three levels have access to all memory. This problem forces the guest OS to run in ring 3, the same as the applications themselves. This in turn causes problems in implementing the VMM.
Some VMMs support fixed partitioning of the hardware resources, especially memory; others do it dynamically. VMware uses a "balloon process" that it "inflates" inside a VM when the VMM wants to recover some memory. The inflation causes the VM to page out some memory to give to the balloon process, which the balloon process then gives to the VMM to be given to another VM.
One important problem for desktop sharing of OSes: how do you communicate between them? "Drag and drop" for files would be incredibly useful, but done improperly it's hard to build and maintain, and a possible security hole. The simplest approach, and the one that VMware initially took, is to allow the two VMs to communicate through the network, as if they were running on separate machines.
But it's a very restricted OS, focusing only on the resource management. It provides no GUI library calls, no shell, no real file system or network stack; it allows the guest OSes to provide all of those.
The key is to avoid "feature creep" or "software bloat" so that the VMM remains lightweight.
All of these are serious concerns. But because the VMM is small and does not run user processes directly, it is easier to make secure. Moreover, the VMM is generally a net win in security, because it can sometimes recognize attempts to subvert the guest OSes running on top of it.
第13回 7月13日 OS事例研究
Lecture 13, July 13: Real-Time and Embedded Systems
Followup for this week: