慶應義塾大学
2008年度 春学期
システム・ソフトウェア
System Software / Operating Systemsオペレーティングシステム
第7回 6月3日 ページ置換アルゴリズム
Lecture 7, June 3: Virtual Memory仮そう記録 and Page Replacementページ置き換え Algorithms
Outline
- Simple Swapping
- Introduction to Virtual Memory仮そう記録
- Page Replacementページ置き換え Algorithms
- The Mechanics of Paging: Page Tables and Page Files
- Final Thoughts
Simple Swapping
- Multiprogramming originally involved swapping complete
processプロセスes in and out of memory.
The original form of multiprogramming actually
involved swapping complete processプロセスes into and out of memory, to
a special reserved area of disk (or drum). This approach allowed each
processプロセス to act as if it owned all of the memory in the system, without
worrying about other processプロセスes. However, swapping a processプロセス out and
in is not fast! We want to be able to share the resources of
the computer among multiple processプロセスes, allowing fast
processプロセス context switches so that multiple programs can appear
to be using the CPU and other resources at the same time.
Introduction to Virtual Memory仮そう記録
- Each processプロセス has its own address space.
- Independent address spaces provide protection and naming.
- Page tables are maintained by the OS and used by the
hardware to map logical addresses to physical
addresses.
- The memory reference trace describes the order in which
a program accesses memory.
- When memory pressure is high, the amount of memory in
active use exceeds the amount available, and the system
must page out some data.
Address Spaces: Protection and Naming
Finally, we come to virtual memory仮そう記録 (仮想記録). With virtual
memory, each processプロセス has its own address space. This
concept is a very important instance of naming. Virtual memory仮そう記録
(VM) provides several important capabilities:
- VM hides some of the layers of the memory hierarchy.
- VM's most common use is to make memory appear larger than it is.
- VM also provides protection and naming, and those
are independent of the above role.
In most modern microprocessプロセスors intended for general-purpose use, a
memory management管理 unit, or MMU, is built into the
hardware. The MMU's job is to translate virtual addresses into
physical addresses.
Page Tables
(thanks to Chishiro for spotting those excellent diagrams on Wikipedia.)
Virtual memory仮そう記録 is usually done by dividing memory up into
pages, which in Unix systems are typically, but not
necessarily, four kilobytes (4KB) each. The page table is the
data structure that holds the mapping from virtual to physical
addresses. The page frame is the actual physical storage in
memory.
The simplest approach would be a large, flat page table with one entry
per page. The entries are known as page table entries, or
PTEs. However, this approach results in a page table that is
too large to fit inside the MMU itself, meaning that it has to be in
memory. In fact, for a 4GB address space, with 32-bit PTEs and 4KB
pages, the page table alone is 4MB! That's big when you consider that
there might be a hundred processプロセスes running on your system.
The solution is multi-level page tables. As the size of the
processプロセス grows, additional pages are allocated, and when they are
allocated the matching part of the page table is filled in.
The translation from virtual to physical address must be fast.
This fact argues for as much of the translation as possible to be done
in hardware, but the tradeoff is more complex hardware, and more
expensive processプロセス switches. Since it is not practical to put the
entire page table in the MMU, the MMU includes what is called the
TLB: translation lookaside buffer.
Memory Pressure and the Reference Trace
We will discuss the processプロセス of paging, where parts of memory
are stored on disk when memory pressure is high. The memory
pressure is the number of pages that processプロセスes and the kernel are
currently trying to access, compared to the number of physical pages
(or page frames) that are available in the system. When the pressure
is low, everything fits comfortably in memory, and only things that
have never been referenced have to be brought in from disk. When the
pressure is high, not everything fits in memory, and we must swap
out entire processプロセスes or page out portions of processプロセスes (in
some systems, parts of the kernel may also be pageable).
The reference trace is the way we describe what memory has
recently been used. It is the total history of the system's accesses
to memory. The reference trace is an important theoretical concept,
but we can't track it exactly in the real world, so various algorithms
for keeping approximate track have been developed.
Paging
Paging is the processプロセス of moving data into and out of
the backing store where the not-in-use data is kept. When the
system decides to reduce the amount of physical memory that a processプロセス
is using, it pages out some of the processプロセス's memory. The
opposite action, bringing some memory in from the backing store, is
calling paging in.
When an application attempts to reference a memory address, and the
address is not part of the processプロセス's address space, a page
fault occurs. The fault traps into the kernel, which must decide
what to do about it. If the processプロセス is not allowed to access the
page, on a Unix machine a segmentation fault is signalled to
the application. If the kernel finds the memory that the application
was attempting to access elsewhere in memory, it can add that page to
the application's address space. We call this a soft fault.
If the desired page must be retrieved from disk, it is known as a
hard fault.
Page Replacementページ置き換え Algorithms
- OPT is the ideal, provably optimal algorithm. However,
it can't be realized in practice.
- LRU is a good compromise, but is difficult to fully
implement.
- Various approximations to LRU are used.
- Decisions on paging can be made either locally or globally.
When the kernel decides to page something in, and the memory is full,
it must decide what to page out. We are looking for several
features in a page replacementページ置き換え algorithm:
- Simple to implement correctly
- Low run-time cost for maintaining data structures
- Minimizes amount of paging activity (in and out) in normal case
- Robust against pathological reference traces
Put in skis, yacht, bowling ball, baseball here...
demonstrate OPT pretty easily.
The optimal algorithm (known as OPT or the
clairvoyant algorithm) is known: throw out the page that will
not be reused for the long time in the future. Unfortunately, it's
impossible to implement, since we don't know the exact future
reference trace until we get there!
There are many algorithms for doing page replacementページ置き換え algorithms, some
of which require extra hardware support. (Most take advantage of the
referenced and modified bits in the PTE.) Here are a
few:
- OPT
- FIFO
- NRU
- LRU
- Clock
- Working set
FIFO is pretty obvious: first-in, first-out. It doesn't work
terribly well. The others we'll look at one by one.
NRU
NRU, or Not Recently Used, uses the referenced
and modified bits (Tanenbaum refers to these as the R
and M bits; on x86, they are Accessed and Dirty)
in the PTE to implement a very simple algorithm. The two bits divide
the pages up into four classes. Any page from the lowest occupied
class can be chosen to be replaced. An important feature of this
algorithm is that it prefers to replace pages that have
not been modified, which saves a disk I/O to write them out.
When a clock interrupt割り込み is received, all of the R and M
bits in all of the page tables of all of the memory-resident processプロセスes
are cleared. (Obviously, this is expensive, but there are a number of
optimizations that can be done.) The R bit is then set
whenever a page is accessed, and the M bit is set whenever a
page is modified. The MMU may do this automatically in hardware, or
it can be emulated by setting the protection bits in the PTE to trap,
then letting the trap handler manipulate R and M
appropriately and clear the protection bits.
Because NRU does not distinguish among the pages in one of the four
classes, it often makes poor decisions about what pages to page
out.
LRU
LRU, or Least Recently Used, is a pretty good
approximation to OPT, but far from perfect. The full implementation実装
of LRU would require being able to exactly order the PTEs according to
recency of access. A linked list could be used, or a counter stored
in the PTE itself. In either case, every memory access requires
updating an in-memory data structure, which is too expensive.
According to one source, Linux, FreeBSD and Solaris may all use a very
heavily-modified form of LRU. (I suspect this information情報 is out of
date, but have not had time to dig through the Linux kernel yet.)
Clock (Second Chance)
With the clock algorithm, all of the page frames are ordered in
a ring. The order never has to change, and few changes are made to
the in-memory structures, so its execution performance is good. The
algorithm uses a clock hand that points to a position in the
ring. When a page fault occurs, the memory manager checks the page
currently pointed to. If the R bit is zero, that page is
replaced. If R is one, then R is set to zero, and the
clock hand is advanced until a page with R = 0 is found. This
algorithm is also called second chance.
I believe early versions of BSD used clock; I'm not sure if they still
do.
Working Set
Early on in the history of virtual memory仮そう記録, researcher研究者s recognized that
not all pages are accessed uniformly. Every processプロセス has some pages
that it accesses frequently, and some that are accessed only
occasionally. The set of pages that a processプロセス is currently accessing
is known as its working set. Some VM systems attempt to track
this set, and page it in or out as a unit. Wikipedia says that VMS
uses a form of FIFO, but my recollection is that it actually uses a
form of working set.
Global v. Local (per-processプロセス) Replacement
So far, we have described paging in terms of a single processプロセス, or a
single reference string. However, in a multiprogrammed machine,
several processプロセスes are active at effectively the same time. How do you
allocate memory among the various processプロセスes? Is all of memory treated
as one global pool, or are there per-processプロセス limits?
Most systems include at least a simple way to set a maximum upper
bound on the size of virtual memory仮そう記録 and on the size of the
resident memory set. On a Unix or Linux system, the shell
usually provides a builtin function機能・関数 called ulimit, which
will tell you what those limits are. The corresponding system callシステムコールs
are getrlimit and setrlimit. VMS has many parameters
that control the behavior of the VM system.
Note that while the other algorithms conceivably work well when
treating all processプロセスes as a single pool, working set does not.
The Mechanics of Paging: Page Tables and Page Files
- The exact page table format is determined primarily by
hardware, but its utilization and management管理 are done in
software.
- When pages are paged out, they go to a page file on
disk.
- In many systems, you will see a page daemon processプロセス, or
virtual processプロセス, that is responsible for managing the paging
I/O.
Reviewing:
- The page table holds the mapping from virtual to physical
addresses.
- The page frame is the actual physical storage in
memory.
- A page table entry, or PTE, is the mapping from virtual
address to page frame, and is usually mostly interpreted by
hardware.
- The page table must be kept in memory, but the translation
lookaside buffer (TLB) is a structure in the CPU's memory
controller that keeps part of the table, to make translation
fast.
- Because simple page tables would be huge, multi-level
page tables are used.
Linux Page Tables
PGD is the page global directory. PTE is page table entry, of
course. PMD is page middle directory.
(Images from O'Reilly's book on Linux device drivers, and from
lvsp.org.)
We don't have time to go into the details right now, but you should be
aware that doing the page tables for a 64-bit processプロセスor is a
lot more complicated, when performance is taken into
consideration.
Linux uses a three-level page table system. Each level supports 512
entries: "With Andi's patch, the x86-64 architecture implements a
512-entry PML4 directory, 512-entry PGD, 512-entry PMD, and 512-entry
PTE. After various deductions, that is sufficient to implement a 128TB
address space, which should last for a little while," says Linux
Weekly News.
#define IA64_MAX_PHYS_BITS 50 /* max. number of physical address bits (architected) */
...
/*
* Definitions for fourth level:
*/
#define PTRS_PER_PTE (__IA64_UL(1) << (PTRS_PER_PTD_SHIFT))
Page and Swap Files and Partitions
Historically, VM systems often used a dedicated area of the disk,
known as the swap partition to hold pages of a processプロセス that
have been paged out. There were two good reasons for this:
- The file system address translation and locking overheads were
considered to be too high for the high performance requirements of
VM.
- Separating the area out allows the system manager to manage the
space statically, at system configuration time, so that the system
doesn't have a problem with the VM system when e.g. the file system
fills up.
Now it is generally accepted that file system performance is
acceptable, and that being able to dynamically (or, at least, without
repartitioning the disk drive) allocate space for swapping is
important.
In modern systems, multiple page files are usually supported, and can
often be added dynamically. See the system callシステムコール swapon() on
Unix/Linux systems.
Paging Daemons
Often, the responsibility for managing pages to be swapped in and out,
especially the I/O necessary, is delegated to a special processプロセス known
as the page daemon.
Final Thoughts
- File I/O has a big impact on paging performance, especially in
single-disk systems.
- Relational databases and CPU caches follow many of the same
principles概念 we discussed today.
Impact of Streaming I/O
Streaming I/O (video, audio, etc.) data tends to be used only
once. However, the VM system does not necessarily know this. If the
VM system behaves normally, streaming I/O pages are recently
referenced, and any other pages will be paged out in preference.
Virtualization
When we get to system virtualization, such as VMware, Xen,
etc., we will see that some of the details of page tables and memory
management管理 change. However, the principles概念 remain the same.
Paging in Other Contexts
The exact same set of algorithms and techniques can be used inside
e.g. a relational database to decide which objects to keep in
memory, and which to flush. The memory cache in a CPU uses
exactly the same set of techniques, except that it must all be
implemented in hardware. The same aging and garbage
collection techniques apply to any finite cache, including a DNS
name translation cache.
Homeworkかだい, Etc.
Homeworkかだい
This week's homeworkかだい:
- Report on your progress on your project.
- Estimate how long it would take to swap out an entire
processプロセス on your machine.
- How fast is your disk in megabytes/second, roughly?
- Pick a processプロセス on your system (say, Word or Firefox). How big
is it, in MB of memory?
- Divide. How long will it take to write out the whole
processプロセス, assuming that it can be written linearly at full
disk bandwidth?
- Go back and rerun your memory copy experiments for sizes up to
100MB or so, and produce a
graph with error bars and a linear fit線形フィット. What is your Y intercept (the
fixed, overhead cost) and your slope (the per-unit cost)? Tell me why
you believe the linear fit線形フィット does or does not represent the actual cost
of the operation.
- Now run up to sizes much larger than your physical memory. What
happens? Graph the output. (Note: this may take a long time to run!)
Readings for Next Week and Followup for This
Week
その他 Additional Information情報