慶應義塾大学
2008年度春学期

システム・ソフトウェア
System Software / Operating Systemsオペレーティングシステム

2007年度春学期　火曜日2時限
科目コード: 60730
開講場所：SFC
授業形態：講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第4回 5月13日プロセス間通信
Lecture 4, May 13: Inter-processプロセス Communication通信

Outline

Our goal today is to get to Mars.

Administrivia
- I moved!
- No office hours tomorrow
Okay, now where were we?
- Finding the trap instruction
- Project proposals are past due!
- Review...
Inter-Processプロセス Communication通信
- Types and purposes of IPC
- Pipes
- Basic Synchronization同期: Interrupt割り込みs, Race Conditions and Critical Sections
- Dining Philosophers哲学者: Resource allocation and deadlock
- Readers and Writers: Consistency
- Sleeping Barber: Requesting service
- Message passing v. shared memory
- Granularity of Locking
- Priority Inversion優先順位の逆転 on Mars

Administrivia

I moved!

My desk is now in Delta N211, if you need to find me.

No office hours tomorrow.
Please see me after class or send me email if you want to see me.

Okay, Now Where Were We?

Finding the Trap Instruction

The assembly code exercise produced interesting results. The different approaches of the library codes, even among Linux distributions, showed various choices in library design: trading shared code versus efficiency and modularity. Some of the differences are probably also due to different executable file formats, such as ELF, which we will not go into; tradeoffs there include efficiency of loading the file from disk, support for shared libraries, monitoring and debugging. You may want to compare your results with the other members of the class.

But no one actually found the instruction that traps into the kernel. Everyone stopped at a call instruction, which calls a subroutine but not the kernel itself. Even the apparent calls to a function機能・関数 called write are actually library calls.

On my Linux Fedora Core 6 box using an i686 kernel with an Intel Celeron M microprocessプロセスor, the assembly version of the program looks something like this:

	.file	"tmp-write.c"
	.section	.rodata
.LC0:
	.string	"123"
	.text
.globl main
	.type	main, @function機能・関数
main:
	leal	4(%esp), %ecx
	andl	$-16, %esp
	pushl	-4(%ecx)
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ecx
	subl	$36, %esp
	movl	$.LC0, -8(%ebp)
	movl	$3, 8(%esp)
	movl	-8(%ebp), %eax
	movl	%eax, 4(%esp)
	movl	$1, (%esp)
	call	write
	movl	$0, %eax
	addl	$36, %esp
	popl	%ecx
	popl	%ebp
	leal	-4(%ecx), %esp
	ret
	.size	main, .-main
	.ident	"GCC: (GNU) 4.1.1 20070105 (Red Hat 4.1.1-51)"
	.section	.note.GNU-stack,"",@progbits

The starting of the actual program proceeds roughly as follows:

(magic number check finds ELF executable)

_start
__libc_start_main@plt
_dl_runtime_resolve
_dl_fixup 
(approximately 1400 instructions later...)
_init
call_gmon_start (only for programs using gmon monitoring)
(approximately 100 instructions later...)
main

Once we get to main, it's only thirteen instructions to write(), right? Not quite. That call write instruction actually calls a library wrapper routine that does various things before actually making the system callシステムコール...

call write
_dl_runtime_resolve
_dl_fixup
_dl_lookup_symbol_x
(calls strcmp, do_lookup_x...)
(approximately 700 instructions later, hit a break point...)
(gdb) stepi
0x0021e018 in write () from /lib/libc.so.6
1: x/i $pc  0x21e018 :	jne    0x21e03c 
(gdb) 
0x0021e01a in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e01a <__write_nocancel>:	push   %ebx
(gdb) 
0x0021e01b in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e01b <__write_nocancel+1>:	mov    0x10(%esp),%edx
(gdb) 
0x0021e01f in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e01f <__write_nocancel+5>:	mov    0xc(%esp),%ecx
(gdb) 
0x0021e023 in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e023 <__write_nocancel+9>:	mov    0x8(%esp),%ebx
(gdb) 
0x0021e027 in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e027 <__write_nocancel+13>:	mov    $0x4,%eax
(gdb) 
0x0021e02c in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e02c <__write_nocancel+18>:	call   *%gs:0x10
(gdb) 
0x0095c400 in __kernel_vsyscall ()
1: x/i $pc  0x95c400 <__kernel_vsyscall>:	int    $0x80

Breakpoint 4, 0x001bc400 in __kernel_vsyscall ()
(gdb) where
#0  0x001bc400 in __kernel_vsyscall ()
#1  0x0027b033 in __write_nocancel () from /lib/libc.so.6
#2  0x08048387 in main () at tmp-write.c:7
(gdb) x/2i __kernel_vsyscall
0x1bc400 <__kernel_vsyscall>:	int    $0x80
0x1bc402 <__kernel_vsyscall+2>:	ret    
(gdb)

This week's homeworkかだい: anything interesting?

There was one report of a limit of a mere 93 processプロセスes in the fork() exercise, which is clearly too few. I'm not yet sure where the low number comes from. On my system, the limit was well over ten thousand.

Project proposals are past due!

We haven't had class in three weeks, but project proposals were supposed to be due 5/1. Who has done them?

A Little Review...

The Roles of an Operating Systemオペレーティングシステム
What I Expect in a Proposal (How to think like a scientist)

Inter-Processプロセス Communication通信

Types and Purposes of Inter-Processプロセス Communication通信 (IPC)

Last week we saw Unix signals, which are the most primitive form of inter-processプロセス communication通信. One processプロセス can essentially send about four bits of information情報 to another processプロセスor using kill(). Signals are unsolicited and asynchronous非同期. This information情報 can be used for basic time synchronization同期, but that's about it. What are the uses of IPC?

Time synchronization同期 (I've finished; your turn)
Resource management管理 (I'm using the disk)
Data transferデータ転送 for further processプロセスing

Pipes

The simplest form of IPC invented to date is the Unix pipe. You probably use them every day without even thinking about it. A pipe connects the stdout of a processプロセス generating some data to the stdin of a processプロセス consuming the data.

[rdv@localhost linux-2.6.19]$ find . -name \*.c -print | xargs grep do_fork | more
./kernel/fork.c:long do_fork(unsigned long clone_flags,
./kernel/fork.c: * function機能・関数s used by do_fork() cannot be used here directly
./arch/um/kernel/processプロセス.c:     pid = do_fork(CLONE_VM | CLONE_UNTRACED | flags, 0,
./arch/um/kernel/processプロセス.c:             panic("do_fork failed in kernel_thread, errno = %d", pid);
./arch/um/kernel/syscall.c:     ret = do_fork(SIGCHLD, UPT_SP(¤t->thread.regs.regs),
./arch/um/kernel/syscall.c:     ret = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD,
./arch/um/sys-i386/syscalls.c:  ret = do_fork(clone_flags, newsp, ¤t->thread.regs, 0, parent_tid,
./arch/um/sys-x86_64/syscalls.c:        ret = do_fork(clone_flags, newsp, ¤t->thread.regs, 0, pare
nt_tid,
./arch/cris/arch-v10/kernel/processプロセス.c:        return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, ®s
, 0, NULL, NULL);
./arch/cris/arch-v10/kernel/processプロセス.c:  return do_fork(SIGCHLD, rdusp(), regs, 0, NULL, NULL);
./arch/cris/arch-v10/kernel/processプロセス.c:  return do_fork(flags, newusp, regs, 0, parent_tid, child_tid);
./arch/cris/arch-v10/kernel/processプロセス.c:        return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, rdusp(), 
regs, 0, NULL, NULL);
./arch/cris/kernel/processプロセス.c: *  sys_clone and do_fork have a new argument, user_tid
--More--

involves three processプロセスes connected via two pipes: one each for find, xargs, and more. Oh, and the xargs actually repeatedly forks off calls to grep, which use xarg's stdout, so the total number of processプロセスes involved is actually large.

In a VMS system, this kind of operation was substantially more tedious; pipes are one of the features that made Unix such a hacker's paradise. (In VMS, such IPC was usually done either with temporary files, or using an explicit construct known as a mailbox, although it was also possible to redirect the input and output files, known as SYS$INPUT and SYS$OUTPUT.)

Pipes work well partly because of the simplicity of naming due to the semantics of fork. The pipe system callシステムコール gives a simple example, showing how two file descriptors are created for pipe and shared through the fork. (There is also a special form of pipe known as a named pipe which we aren't going to discuss, but you might want to look up if you are interested.)

Pipes also allow for a simple form of parallel processプロセスing and asynchronous非同期 operation; the first processプロセス in the pipeline can be reading data from the disk and buffering it through the pipe to the second processプロセス, without the inherent complexities of asynchronous非同期 read operations. The first processプロセス automatically blocks when the pipe's buffers are full, and the second processプロセス automatically blocks when it tries to read from an empty pipe, and is awakened when data arrives.

Basic Synchronization同期
Interrupt割り込みs, Race Conditions, and Critical Sections

The lecture this week covers enormous amounts of ground very quickly. After much debate with myself, I decided to concentrate on examples of this week's topic, working on the assumption that you have done the readings. If you haven't, this lecture may be difficult to follow. But if I don't make this assumption, we won't become alchemists, and we won't get to Mars.

We must achieve mutual exclusion相互排除 (相互排除、そうごはいじょ) during what is known as a critical section.
The simplest solution is to disable interrupt割り込みs.
This solution doesn't work on multiprocessプロセスors.
Devices are a limited form of multiprocessプロセスor. The solution here often involves re-testing the state after reenabling interrupt割り込みs. At Network Alchemy, we ran into exactly this problem.
Caches are a problem.
The test and set lock instruction is one multiprocessプロセスor solution; it requires bus support, as well.
When the lock attempt fails, you can either busy wait or go to sleep. Busy waiting is known as a spin lock.
In general, the hardware must support some sort of atomic primitive, but if the hardware itself is simple enough, there are answers such as Peterson's solution for alternating execution, and a ring buffer for a simple producer/consumer system (common for e.g. network adapters).
Some systems, such a SCSI bus, use a simple priority mechanism, which can preempting or non-preempting.

Dining Philosophers哲学者: Resource allocation and deadlock
哲学者の食事問題

Plato
Confucius 孔子（こうし）
Socrates
Voltaire
Descartes

Perhaps the most famous resource management管理 problem is the dining philosophers problem. Several philosophers (哲学者、てつがくしゃ) are having dinner at an Italian restaurant, where each philosopher plans to use two forks to eat his spaghetti. They are sitting at a round table, with one fork between each pair of philosophers (there are the same number of forks as philosophers). Each philosopher spends a random amount of time thinking, then suddenly decides he is hungry, and grabs the fork on his left, then the fork on his right. If all of the philosophers grab their left fork at the same time, then no one will be able to get his or her right fork! Deadlock occurs, the system comes to a halt, no philosopher can eat, eventually everyone starves and dies (unless the manager comes along and notices that everyone is stuck, and forces one philosopher to put down a fork and restart).

There is an excellent simulation at Northwestern University.

A full mathematical analysis is beyond our purposes at the moment, but very roughly, if we treat time as discrete, the probability確率 of all n philosophers not currently eating at a particular time and all attempting to get forks at the same time is roughly (fp)^n where f is the fraction of time that a philosopher is eating, and p is the probability確率 that, when not eating, she will attempt to eat. This should also give you some idea of how difficult it is to debug such a problem: reproducing the problem is always the first step, and that will be difficult.

There are many solutions to the problem; one simple one is to have everyone put down their first fork when they fail to get the second fork, pause for a few moments (randomly), then try again. (What are the problems with this?) Another is priority, either with or without preemption, ordering the philosophers and letting them decide in order whether or not to pick up forks at a particular moment. (What are the problems with this solution?)

One useful technique for finding deadlocks when they occur, or for designing locking systems, is the concept of a resource graph. Entities that currently hold or desire certain locks are nodes in the graph. A directed link is drawn from each node that desires a resource currently held by someone else to the node holding the resource. If there is a cycle in the graph, you have deadlock.

Readers and Writers: Consistency

The dining philosophers哲学者 problem treats all resources as the same. However, in many systems, there may be many types of resources, or they may be different uses of the same set of resources. In this section, we will talk about how to support two important uses of objects, reading and writing.
In file systems and databases, it is often necessary to present one consistent version of a data structure. It may not matter whether a reader sees the version before a change to structure takes place or after, but during would be a problem. The simplest solution is to put a simple lock on the structure and only allow one processプロセス to read or write the data structure at a time. This, unfortunately, is not very efficient.
By recognizing that many people can read the data structure without interfering with each other, we divide the lock into two roles: a read lock and a write lock. If a processプロセス requests either role, and the structure is idle, the lock is granted. If a reader arrives and another reader already has the structure locked, then both are allowed to read. If a writer arrives and the structure is locked by either a reader or another writer, then the writer blocks and must wait until the lock is freed.
So far, so good. We have allowed multiple readers or a single writer. The tricky part is to prioritize appropriately those who are waiting for the lock. Causing the writer to wait while allowing new readers to continue to enter risks starving the writer. The simplest good solution is to queue new read requests behind the write request. It's not necessarily efficient, depending on the behavior of the readers, but it will usually do.
Another alternative is to create versions of the data structure; rather than modifying it in place, copy part of the structure so that it can be modified, and allow new readers to continue to come in and access the old version. Once the modification (writing) is complete, a simple pointer switch can put the new version in place so that subsequent readers will get the new version.
Sleeping Barber: Requesting service
Unfortunately, we probably don't have time to discuss this one in class, but I encourage you to finish the reading, if you haven't.

Message passing v. shared memory
We have mostly discussed various forms of synchronization同期, although the title of the lecture was Inter-Processプロセス Communication通信. Now let's return to the exchange of data between two entities. I say "entities" because they can be either processプロセスes or threads. In the case of threads, or sometimes processプロセスes, messages can be passed via shared memory. One thread or processプロセス passes a pointer to the data to the other. This is very efficient, because the data does not have to be copied. However, there are still two issues: resource management管理 and naming. Control of the message buffer is assumed to pass from the producer (生産者) to the consumer (消費者). Questions:

How is the buffer named?
How is the size of the buffer or message constrained? (Cooperatively.)
What does the consumer do with the buffer once it is done using the buffer?
What stops the producer from modifying the buffer once it has sent the message? (Nothing.) What happens?
This approach is common when controlling devices such as network interface cards, and the command blocks for devices such as SCSI adapters.
The alternative is message passing. Message passing involves copying the data from one processプロセス to the other, which is less efficient, but has lots of advantages. Control of buffers is much clearer, the contents can't be modified, and the messages also serve as a natural means of synchronizing and ordering interactions.

Granularity of Locking
The first versions of Linux did not support multiprocessプロセスors. The first version that did (2.0?) used what was known as the Giant Kernel Lock. This system has long since been replaced by a much finer-grained locking scheme.
In file systems or databases, there are tradeoffs between fine-grained locking and large-grained locking. Fine-grained locking is a lot harder to get right, and there are a lot more locking operations so the efficiency of the locking operation itself must be considered, but it allows more concurrency同時実行・平行.

Priority Inversion優先順位の逆転 on Mars
...which brings us to Mars.

In 1997, the Mars Pathfinder craft landed on Mars. Its CPU ran the VxWorks embedded組込み用 operating systemオペレーティングシステム. Famously, this mission had a software bug related to resource locking and scheduling, so it serves as a nice bridge between this lecture and the next. It is a classic case of priority inversion優先順位の逆転 (優先順位の逆転、ゆうせんじゅんいのがくてん), and shows the necessity of priority inheritance優先度継承 (優先度継承、ゆうせんどけいしょう）.
To understand this problem, you need to know the most basic facts about scheduling: most schedulers support a priority mechanism, and no lower-priority task gets to run as long as a higher-priority one wants the CPU. (We will discuss variations on this next week, but the basic idea holds.) The system included numerous tasks, each assigned a different role. There are several that we care about for the purposes of this discussion:

The bus scheduler (top priority among the tasks we are considering)
The bus transfer task (next priority)
Several medium-priority science tasks
The low-priority ASI/MET meteorological (weather) task
In normal operation, the bus scheduler and transfer tasks alternate. Under abnormal conditions, if the scheduler notices that the transfer task has not run, the scheduler will reset the entire system. The problem occurred when ASI/MET acquired a lock needed by the bus transfer task, then the ASI/MET task was blocked from running by the medium-priority science tasks. Thus, ASI/MET never gave up the lock, and the transfer task never got to run, despite being higher priority than any of the science tasks.
This problem was solved by enabling a mechanism known as priority inheritance優先度継承. Assume a low-priority task is holding a particular resource. When a higher-priority task requests the resource, and blocks waiting for it, the lower-priority task then inherits the priority of the task waiting for resources it holds.

Summary
Although we have invoked relativity相対性理論 and naming a few times during this lecture, both problems are much more interesting when you talk about doing inter-processプロセス communication通信 in a distributed分散 fashion.
The general problem of synchronization同期 and resource locking, which Tanenbaum lumps in with IPC, requires a moderate amount of hardware support, but allows both HW and SW solutions. This area is one of the places where you can most directly see the impact of theory on practice.

Homeworkかだい
This week's homeworkかだい is mostly theoretical:

Ignoring the deadlock for a moment, what happens at the philosophers' table when one philosopher dies while hold a fork?
One possible solution to the dining philosophers哲学者 problem is to number the forks, and require the philophers to always pick up an even-numbered fork first.

Does this scheme work?
What happens when there are an odd number of forks?
Can this scheme be extended to work when you need three forks to eat?

Another possible solution, since all forks are identical, is to pile all of the forks in the middle of the table, and have the philosophers grab any two when they want to eat. Does this work better?

Analyze分析 the probability確率 of deadlock, treating time as discrete, based on the number of philosophers, probability確率 of a philospher wanting to eat, and number of forks.

I give you a red disk that you can use as a marker. How would create a protocol that guarantees deadlock avoidance? Is it robust against the death of one of the philosophers?
Describe how the original Ethernet CSMA/CD scheme is like the dining philosophers哲学者 problem.
Find the synchronization同期 primitive used in an Intel Core Duo dual-processプロセスor or an AMD on-chip multiprocessプロセスor.
Write a program that copies one chunk of memory to another (your OS certainly provides some memory copy library routine). Measure its performance. (We will use this information情報 in later exercises in the course.)

Next Lecture
Next lecture:
第5回 5月20日プロセススケジューリング
Lecture 5, May 20: Processプロセス Scheduling
Readings for next week and followup for this week:

Tanenbaum, 2.5
Japanese Wikipedia page on the dining philosophers哲学者
Mars Pathfinder priority inversion problem
Mars Pathfinder response from the team leader
優先度継承 priority inheritance優先度継承 at Wikipedia.
優先順位の逆転 priority inversion優先順位の逆転 at Wikipedia.
If you are interested in deadlock, I highly recommend Chapter 3 of Tanenbaum.

その他 Additional Information情報

Rodney Van Meterのホームページ (home page)

村井研究室 Murai Lab

システム・ソフトウェア System Software / Operating Systemsオペレーティングシステム

第4回 5月13日 プロセス間通信 Lecture 4, May 13: Inter-processプロセス Communication通信

Outline

Administrivia

I moved!

No office hours tomorrow. Please see me after class or send me email if you want to see me.

Okay, Now Where Were We?

Finding the Trap Instruction

This week's homeworkかだい: anything interesting?

Project proposals are past due!

A Little Review...

Inter-Processプロセス Communication通信

Types and Purposes of Inter-Processプロセス Communication通信 (IPC)

Pipes

Basic Synchronization同期 Interrupt割り込みs, Race Conditions, and Critical Sections

Dining Philosophers哲学者: Resource allocation and deadlock 哲学者の食事問題

Readers and Writers: Consistency

Sleeping Barber: Requesting service

Message passing v. shared memory

Granularity of Locking

Priority Inversion優先順位の逆転 on Mars

Summary

Homeworkかだい

Next Lecture

その他 Additional Information情報

システム・ソフトウェア
System Software / Operating Systemsオペレーティングシステム

第4回 5月13日プロセス間通信
Lecture 4, May 13: Inter-processプロセス Communication通信

No office hours tomorrow.
Please see me after class or send me email if you want to see me.

Basic Synchronization同期
Interrupt割り込みs, Race Conditions, and Critical Sections

Dining Philosophers哲学者: Resource allocation and deadlock
哲学者の食事問題