慶應義塾大学
2013年度春学期

システム・ソフトウェア
System Software / Operating Systems

2013年度春学期　火曜日1時限
科目コード: 60730
開講場所：SFC
授業形態：講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第7回 5月28日ページ置換アルゴリズム
Lecture 7, May 28: Virtual Memory and Page Replacement Algorithms

Outline

Simple Swapping
Introduction to Virtual Memory
Page Replacement Algorithms
The Mechanics of Paging: Page Tables and Page Files
Final Thoughts

Simple Swapping

Multiprogramming originally involved swapping complete processes in and out of memory.

The original form of multiprogramming actually involved swapping complete processes into and out of memory, to a special reserved area of disk (or drum). This approach allowed each process to act as if it owned all of the memory in the system, without worrying about other processes. However, swapping a process out and in is not fast! We want to be able to share the resources of the computer among multiple processes, allowing fast process context switches so that multiple programs can appear to be using the CPU and other resources at the same time.

Introduction to Virtual Memory

Each process has its own address space.
Independent address spaces provide protection and naming.
Page tables are maintained by the OS and used by the hardware to map logical addresses to physical addresses.
The memory reference trace describes the order in which a program accesses memory.
When memory pressure is high, the amount of memory in active use exceeds the amount available, and the system must page out some data.

Address Spaces: Protection and Naming

Finally, we come to virtual memory （仮想記録）. With virtual memory, each process has its own address space. This concept is a very important instance of naming. Virtual memory (VM) provides several important capabilities:

VM hides some of the layers of the memory hierarchy.
VM's most common use is to make memory appear larger than it is.
VM also provides protection and naming, and those are independent of the above role.

In most modern microprocessors intended for general-purpose use, a memory management unit, or MMU, is built into the hardware. The MMU's job is to translate virtual addresses into physical addresses.

Page Tables

(thanks to Chishiro for spotting those excellent diagrams on Wikipedia.)

Virtual memory is usually done by dividing memory up into pages, which in Unix systems are typically, but not necessarily, four kilobytes (4KB) each. The page table is the data structure that holds the mapping from virtual to physical addresses. The page frame is the actual physical storage in memory.

The simplest approach would be a large, flat page table with one entry per page. The entries are known as page table entries, or PTEs. However, this approach results in a page table that is too large to fit inside the MMU itself, meaning that it has to be in memory. In fact, for a 4GB address space, with 32-bit PTEs and 4KB pages, the page table alone is 4MB! That's big when you consider that there might be a hundred processes running on your system.

The solution is multi-level page tables. As the size of the process grows, additional pages are allocated, and when they are allocated the matching part of the page table is filled in.

The translation from virtual to physical address must be fast. This fact argues for as much of the translation as possible to be done in hardware, but the tradeoff is more complex hardware, and more expensive process switches. Since it is not practical to put the entire page table in the MMU, the MMU includes what is called the TLB: translation lookaside buffer.

Memory Pressure and the Reference Trace

We will discuss the process of paging, where parts of memory are stored on disk when memory pressure is high. The memory pressure is the number of pages that processes and the kernel are currently trying to access, compared to the number of physical pages (or page frames) that are available in the system. When the pressure is low, everything fits comfortably in memory, and only things that have never been referenced have to be brought in from disk. When the pressure is high, not everything fits in memory, and we must swap out entire processes or page out portions of processes (in some systems, parts of the kernel may also be pageable).

The reference trace is the way we describe what memory has recently been used. It is the total history of the system's accesses to memory. The reference trace is an important theoretical concept, but we can't track it exactly in the real world, so various algorithms for keeping approximate track have been developed.

Paging

Paging is the process of moving data into and out of the backing store where the not-in-use data is kept. When the system decides to reduce the amount of physical memory that a process is using, it pages out some of the process's memory. The opposite action, bringing some memory in from the backing store, is calling paging in.

When an application attempts to reference a memory address, and the address is not part of the process's address space, a page fault occurs. The fault traps into the kernel, which must decide what to do about it. If the process is not allowed to access the page, on a Unix machine a segmentation fault is signalled to the application. If the kernel finds the memory that the application was attempting to access elsewhere in memory, it can add that page to the application's address space. We call this a soft fault. If the desired page must be retrieved from disk, it is known as a hard fault.

Page Replacement Algorithms

OPT is the ideal, provably optimal algorithm. However, it can't be realized in practice.
LRU is a good compromise, but is difficult to fully implement.
Various approximations to LRU are used.
Decisions on paging can be made either locally or globally.

When the kernel decides to page something in, and the memory is full, it must decide what to page out. We are looking for several features in a page replacement algorithm:

Simple to implement correctly
Low run-time cost for maintaining data structures
Minimizes amount of paging activity (in and out) in normal case
Robust against pathological reference traces

We can demonstrate OPT pretty easily:

Time Step	1	2	3	4	5	6	7	8	9
Reference
Page In?	YES	YES	YES	NO	YES	NO	NO	NO	YES
Page 0
Page 1
Page 2

The optimal algorithm (known as OPT or the clairvoyant algorithm) is known: throw out the page that will not be reused for the longest time in the future. Unfortunately, it's impossible to implement, since we don't know the exact future reference trace until we get there!

There are many algorithms for doing page replacement algorithms, some of which require extra hardware support. (Most take advantage of the referenced and modified bits in the PTE.) Here are a few:

OPT
FIFO
NRU
LRU
Clock
Working set

FIFO

Time Step	1	2	3	4	5	6	7	8	9
Reference
Page In?	YES	YES	YES	NO	YES	YES	YES	NO	YES
Page 0
Page 1
Page 2

FIFO is pretty obvious: first-in, first-out. If your working set is large enough, it works okay, but if you don't have enough memory, it doesn't work terribly well, but it's easy to implement.

LRU

Time Step	1	2	3	4	5	6	7	8	9
Reference
Page In?	YES	YES	YES	NO	YES	NO	YES	NO	YES
Page 0
Page 1
Page 2

LRU, or Least Recently Used, is a pretty good approximation to OPT, but far from perfect. The full implementation of LRU would require being able to exactly order the PTEs according to recency of access. A linked list could be used, or a counter stored in the PTE itself. In either case, every memory access requires updating an in-memory data structure, which is too expensive.

According to one source, Linux, FreeBSD and Solaris may all use a very heavily-modified form of LRU. (I suspect this information is out of date, but have not had time to dig through the Linux kernel yet.)

NRU

NRU, or Not Recently Used, uses the referenced and modified bits (Tanenbaum refers to these as the R and M bits; on x86, they are Accessed and Dirty) in the PTE to implement a very simple algorithm. The two bits divide the pages up into four classes. Any page from the lowest occupied class can be chosen to be replaced. An important feature of this algorithm is that it prefers to replace pages that have not been modified, which saves a disk I/O to write them out.

When a clock interrupt is received, all of the R and M bits in all of the page tables of all of the memory-resident processes are cleared. (Obviously, this is expensive, but there are a number of optimizations that can be done.) The R bit is then set whenever a page is accessed, and the M bit is set whenever a page is modified. The MMU may do this automatically in hardware, or it can be emulated by setting the protection bits in the PTE to trap, then letting the trap handler manipulate R and M appropriately and clear the protection bits.

Because NRU does not distinguish among the pages in one of the four classes, it often makes poor decisions about what pages to page out.

Clock (Second Chance)

With the clock algorithm, all of the page frames are ordered in a ring. The order never has to change, and few changes are made to the in-memory structures, so its execution performance is good. The algorithm uses a clock hand that points to a position in the ring. When a page fault occurs, the memory manager checks the page currently pointed to. If the R bit is zero, that page is replaced. If R is one, then R is set to zero, and the clock hand is advanced until a page with R = 0 is found. This algorithm is also called second chance.

I believe early versions of BSD used clock; I'm not sure if they still do.

Comparing the Algorithms

Algorithm	Data Structures	Page In	Page Out	Reference	Comments
OPT	(list of future references)	(nothing)	Pick one w/ farthest future reference	(nothing)	Nice, but impossible!
FIFO	Linked list of pages	Add to tail of list	Take page from head of list	(nothing)	Simple, fast, no HW support needed, but bad in many cases
LRU	e.g., tree of pages indexed by reference time	Add to tree with current time	Take oldest page from tree	Update page's position in tree using current time	Timestamps consume memory, update on each reference very expensive; essentially impossible to implement exactly
NRU	Just two HW bits (R and M, or A and D) in basic form	Mark as clean, accessed	Find any clean, not-accessed page (preferred)	Set accessed bit, set dirty if a memory write	Straightforward, mostly okay, but makes bad decisions frequently; scanning for page to swap out can be expensive
Clock (Second Chance)	Just one HW bit (A) in basic form	Mark as accessed	Check accessed bit, clear bit and increment clock hand if set, otherwise use	Set accessed bit	Big improvement on NRU, easy to implement; scanning for page to swap out can be expensive; needs timer routine to clear all accessed bits

Working Set

Early on in the history of virtual memory, researchers recognized that not all pages are accessed uniformly. Every process has some pages that it accesses frequently, and some that are accessed only occasionally. The set of pages that a process is currently accessing is known as its working set. Some VM systems attempt to track this set, and page it in or out as a unit. Wikipedia says that VMS uses a form of FIFO, but my recollection is that it actually uses a form of working set.

In its purest form, working set is an all-or-nothing proposition. If the OS sees that there are enough pages available to hold your working set, you are allowed stay in memory. If there is not enough memory, then the entire process gets swapped out.

Global v. Local (per-process) Replacement

So far, we have described paging in terms of a single process, or a single reference string. However, in a multiprogrammed machine, several processes are active at effectively the same time. How do you allocate memory among the various processes? Is all of memory treated as one global pool, or are there per-process limits?

Most systems include at least a simple way to set a maximum upper bound on the size of virtual memory and on the size of the resident memory set. On a Unix or Linux system, the shell usually provides a builtin function called ulimit, which will tell you what those limits are. The corresponding system calls are getrlimit and setrlimit. VMS has many parameters that control the behavior of the VM system.

Note that while the other algorithms conceivably work well when treating all processes as a single pool, working set does not.

Linux 2.6.11 Page Frame Reclamation Algorithm (PFRA)

The Linux kernel goes through many extraordinarily complex operations to find a candidate set of pages that might be discarded. Once it has that list, it applies the following algorithm (Fig. 17.5 from Understanding the Linux Kernel, showing the PFRA):

Fig. 17.5 from Understanding the
Linux Kernel, showing the PFRA

Caveat emptor: Apparently, the function free_cold_page() was removed from the kernel August 2009. (Note added 2013/5/28.)

Anonymous page: An anonymous page is one without an allocated page in the file system, e.g., "pure" process virtual memory (clean or dirty, with or without swap page allocated, even with or without a page frame allocated).

The difference btw page cache and swap cache
Shared pages may be pointed to by many PTEs, but it's hard to figure out which ones. Update of a shared page while attempting to swap it out therefore is fraught with potential for missed updates.

Shared pages that have a reserved slot in backing storage are considered to be part of the swap cache. The swap cache is a purely conceptual specialization of the page cache. The first principal difference btw pages in the swap cache rather than the page cache is that pages in the swap cache always use swapper_space at their address_space in page->mapping. (Quoted from Gorman, 11.4, Swap Cache)

The Mechanics of Paging: Page Tables and Page Files

The exact page table format is determined primarily by hardware, but its utilization and management are done in software.
When pages are paged out, they go to a page file on disk.
In many systems, you will see a page daemon process, or virtual process, that is responsible for managing the paging I/O.

Reviewing:

The page table holds the mapping from virtual to physical addresses.
The page frame is the actual physical storage in memory.
A page table entry, or PTE, is the mapping from virtual address to page frame, and is usually mostly interpreted by hardware.
The page table must be kept in memory, but the translation lookaside buffer (TLB) is a structure in the CPU's memory controller that keeps part of the table, to make translation fast.
Because simple page tables would be huge, multi-level page tables are used.

Linux Page Tables

PGD is the page global directory. PTE is page table entry, of course. PMD is page middle directory.

(Images from O'Reilly's book on Linux device drivers, and from lvsp.org.)

We don't have time to go into the details right now, but you should be aware that doing the page tables for a 64-bit processor is a lot more complicated, when performance is taken into consideration.

Linux uses a three-level page table system. Each level supports 512 entries: "With Andi's patch, the x86-64 architecture implements a 512-entry PML4 directory, 512-entry PGD, 512-entry PMD, and 512-entry PTE. After various deductions, that is sufficient to implement a 128TB address space, which should last for a little while," says Linux Weekly News.

#define IA64_MAX_PHYS_BITS      50      /* max. number of physical address bits (architected) */
...
/*
 * Definitions for fourth level:
 */
#define PTRS_PER_PTE    (__IA64_UL(1) << (PTRS_PER_PTD_SHIFT))

Page and Swap Files and Partitions

Historically, VM systems often used a dedicated area of the disk, known as the swap partition to hold pages of a process that have been paged out. There were two good reasons for this:

The file system address translation and locking overheads were considered to be too high for the high performance requirements of VM.
Separating the area out allows the system manager to manage the space statically, at system configuration time, so that the system doesn't have a problem with the VM system when e.g. the file system fills up.

Now it is generally accepted that file system performance is acceptable, and that being able to dynamically (or, at least, without repartitioning the disk drive) allocate space for swapping is important.

In modern systems, multiple page files are usually supported, and can often be added dynamically. See the system call swapon() on Unix/Linux systems.

Paging Daemons

Often, the responsibility for managing pages to be swapped in and out, especially the I/O necessary, is delegated to a special process known as the page daemon.

Final Thoughts

File I/O has a big impact on paging performance, especially in single-disk systems.
Relational databases and CPU caches follow many of the same principles we discussed today.

Impact of Streaming I/O

Streaming I/O (video, audio, etc.) data tends to be used only once. However, the VM system does not necessarily know this. If the VM system behaves normally, streaming I/O pages are recently referenced, and any other pages will be paged out in preference.

Virtualization

When we get to system virtualization, such as VMware, Xen, etc., we will see that some of the details of page tables and memory management change. However, the principles remain the same.

Paging in Other Contexts

The exact same set of algorithms and techniques can be used inside e.g. a relational database to decide which objects to keep in memory, and which to flush. The memory cache in a CPU uses exactly the same set of techniques, except that it must all be implemented in hardware. The same aging and garbage collection techniques apply to any finite cache, including a DNS name translation cache.

Homework, Etc.

Homework

This week's homework:

Report on your progress on your project.

Readings for Next Week and Followup for This Week

This section of the textbook is available online.
日本語版の Wikipedia: 仮想記録.
Wikipedia actually has some good references on page replacement algorithms. ( 日本語版)
The section of the book on page replacement algorithms from The Design and Implementation of the FreeBSD Operating System, which is one of the books you really ought to have on your shelf.
Chapter 17 of Understanding the Linux Kernel, 3rd edition describes the Linux 2.6.11 Page Frame Reclamation Algorithm (PFRA).
Linux 2.4 kernel page replacement (by one of the authors of the system). The paper was written in 2001, and everything in Linux has changed since then, so take it with a grain of salt.
The same author, Rik van Riel, in Feb. 2007 posted a set of requirements for Linux page replacement algorithms.
And, in 2006, this article about advanced page replacement algorithms, which is really a stub to various test implementations for Linux.

システム・ソフトウェア System Software / Operating Systems

第7回 5月28日 ページ置換アルゴリズム Lecture 7, May 28: Virtual Memory and Page Replacement Algorithms