慶應義塾大学
2008年度 春学期
システム・ソフトウェア
System Software / Operating Systemsオペレーティングシステム
第4回 5月13日 プロセス間通信
Lecture 4, May 13: Inter-processプロセス Communication通信
Outline
Our goal today is to get to Mars.
- Administrivia
- I moved!
- No office hours tomorrow
- Okay, now where were we?
- Finding the trap instruction
- Project proposals are past due!
- Review...
- Inter-Processプロセス Communication通信
- Types and purposes of IPC
- Pipes
- Basic Synchronization同期: Interrupt割り込みs, Race Conditions and Critical
Sections
- Dining Philosophers哲学者: Resource allocation and deadlock
- Readers and Writers: Consistency
- Sleeping Barber: Requesting service
- Message passing v. shared memory
- Granularity of Locking
- Priority Inversion優先順位の逆転 on Mars
Administrivia
I moved!
My desk is now in Delta N211, if you need to find me.
No office hours tomorrow.
Please see me after class or send me email if you want to see me.
Okay, Now Where Were We?
Finding the Trap Instruction
The assembly code exercise produced interesting results. The
different approaches of the library codes, even among Linux
distributions, showed various choices in library design: trading
shared code versus efficiency and modularity. Some of the differences
are probably also due to different executable file formats,
such as ELF, which we will not go into; tradeoffs there include
efficiency of loading the file from disk, support for shared
libraries, monitoring and debugging. You may want to compare your
results with the other members of the class.
But no one actually found the instruction that traps into the kernel.
Everyone stopped at a call instruction, which calls a
subroutine but not the kernel itself. Even the apparent calls to a
function機能・関数 called write are actually library calls.
On my Linux Fedora Core 6 box using an i686 kernel with an Intel
Celeron M microprocessプロセスor, the assembly version of the program looks
something like this:
.file "tmp-write.c"
.section .rodata
.LC0:
.string "123"
.text
.globl main
.type main, @function機能・関数
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $36, %esp
movl $.LC0, -8(%ebp)
movl $3, 8(%esp)
movl -8(%ebp), %eax
movl %eax, 4(%esp)
movl $1, (%esp)
call write
movl $0, %eax
addl $36, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
.ident "GCC: (GNU) 4.1.1 20070105 (Red Hat 4.1.1-51)"
.section .note.GNU-stack,"",@progbits
The starting of the actual program proceeds roughly as follows:
(magic number check finds ELF executable)
_start
__libc_start_main@plt
_dl_runtime_resolve
_dl_fixup
(approximately 1400 instructions later...)
_init
call_gmon_start (only for programs using gmon monitoring)
(approximately 100 instructions later...)
main
Once we get to main, it's only thirteen instructions to
write(), right? Not quite. That call write instruction
actually calls a library wrapper routine that does various
things before actually making the system callシステムコール...
call write
_dl_runtime_resolve
_dl_fixup
_dl_lookup_symbol_x
(calls strcmp, do_lookup_x...)
(approximately 700 instructions later, hit a break point...)
(gdb) stepi
0x0021e018 in write () from /lib/libc.so.6
1: x/i $pc 0x21e018 : jne 0x21e03c
(gdb)
0x0021e01a in __write_nocancel () from /lib/libc.so.6
1: x/i $pc 0x21e01a <__write_nocancel>: push %ebx
(gdb)
0x0021e01b in __write_nocancel () from /lib/libc.so.6
1: x/i $pc 0x21e01b <__write_nocancel+1>: mov 0x10(%esp),%edx
(gdb)
0x0021e01f in __write_nocancel () from /lib/libc.so.6
1: x/i $pc 0x21e01f <__write_nocancel+5>: mov 0xc(%esp),%ecx
(gdb)
0x0021e023 in __write_nocancel () from /lib/libc.so.6
1: x/i $pc 0x21e023 <__write_nocancel+9>: mov 0x8(%esp),%ebx
(gdb)
0x0021e027 in __write_nocancel () from /lib/libc.so.6
1: x/i $pc 0x21e027 <__write_nocancel+13>: mov $0x4,%eax
(gdb)
0x0021e02c in __write_nocancel () from /lib/libc.so.6
1: x/i $pc 0x21e02c <__write_nocancel+18>: call *%gs:0x10
(gdb)
0x0095c400 in __kernel_vsyscall ()
1: x/i $pc 0x95c400 <__kernel_vsyscall>: int $0x80
Breakpoint 4, 0x001bc400 in __kernel_vsyscall ()
(gdb) where
#0 0x001bc400 in __kernel_vsyscall ()
#1 0x0027b033 in __write_nocancel () from /lib/libc.so.6
#2 0x08048387 in main () at tmp-write.c:7
(gdb) x/2i __kernel_vsyscall
0x1bc400 <__kernel_vsyscall>: int $0x80
0x1bc402 <__kernel_vsyscall+2>: ret
(gdb)
This week's homeworkかだい: anything interesting?
There was one report of a limit of a mere 93 processプロセスes in the
fork() exercise, which is clearly too few. I'm not yet sure
where the low number comes from. On my system, the limit was well
over ten thousand.
Project proposals are past due!
We haven't had class in three weeks, but project proposals were
supposed to be due 5/1. Who has done them?
A Little Review...
Inter-Processプロセス Communication通信
Types and Purposes of Inter-Processプロセス Communication通信
(IPC)
Last week we saw Unix signals, which are the most primitive
form of inter-processプロセス communication通信. One processプロセス can essentially send
about four bits of information情報 to another processプロセスor using
kill(). Signals are unsolicited and
asynchronous非同期. This information情報 can be used for basic time
synchronization同期, but that's about it. What are the uses of IPC?
- Time synchronization同期 (I've finished; your turn)
- Resource management管理 (I'm using the disk)
- Data transferデータ転送 for further processプロセスing
Pipes
The simplest form of IPC invented to date is the Unix
pipe. You probably use them every day without even thinking about
it. A pipe connects the stdout of a processプロセス generating some
data to the stdin of a processプロセス consuming the data.
[rdv@localhost linux-2.6.19]$ find . -name \*.c -print | xargs grep do_fork | more
./kernel/fork.c:long do_fork(unsigned long clone_flags,
./kernel/fork.c: * function機能・関数s used by do_fork() cannot be used here directly
./arch/um/kernel/processプロセス.c: pid = do_fork(CLONE_VM | CLONE_UNTRACED | flags, 0,
./arch/um/kernel/processプロセス.c: panic("do_fork failed in kernel_thread, errno = %d", pid);
./arch/um/kernel/syscall.c: ret = do_fork(SIGCHLD, UPT_SP(¤t->thread.regs.regs),
./arch/um/kernel/syscall.c: ret = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD,
./arch/um/sys-i386/syscalls.c: ret = do_fork(clone_flags, newsp, ¤t->thread.regs, 0, parent_tid,
./arch/um/sys-x86_64/syscalls.c: ret = do_fork(clone_flags, newsp, ¤t->thread.regs, 0, pare
nt_tid,
./arch/cris/arch-v10/kernel/processプロセス.c: return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, ®s
, 0, NULL, NULL);
./arch/cris/arch-v10/kernel/processプロセス.c: return do_fork(SIGCHLD, rdusp(), regs, 0, NULL, NULL);
./arch/cris/arch-v10/kernel/processプロセス.c: return do_fork(flags, newusp, regs, 0, parent_tid, child_tid);
./arch/cris/arch-v10/kernel/processプロセス.c: return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, rdusp(),
regs, 0, NULL, NULL);
./arch/cris/kernel/processプロセス.c: * sys_clone and do_fork have a new argument, user_tid
--More--
involves three processプロセスes connected via two pipes: one each for
find, xargs, and more. Oh, and the xargs
actually repeatedly forks off calls to grep, which use
xarg's stdout, so the total number of processプロセスes involved
is actually large.
In a VMS system, this kind of operation was substantially more
tedious; pipes are one of the features that made Unix such a
hacker's paradise. (In VMS, such IPC was usually done either with
temporary files, or using an explicit construct known as a
mailbox, although it was also possible to redirect the input
and output files, known as SYS$INPUT and SYS$OUTPUT.)
Pipes work well partly because of the simplicity of naming due
to the semantics of fork. The pipe system callシステムコール gives a
simple example, showing how two file descriptors are created for pipe
and shared through the fork. (There is also a special form of pipe
known as a named pipe which we aren't going to discuss, but you
might want to look up if you are interested.)
Pipes also allow for a simple form of parallel processプロセスing and
asynchronous非同期 operation; the first processプロセス in the pipeline can
be reading data from the disk and buffering it through the pipe to the
second processプロセス, without the inherent complexities of asynchronous非同期
read operations. The first processプロセス automatically blocks when
the pipe's buffers are full, and the second processプロセス automatically
blocks when it tries to read from an empty pipe, and is awakened when
data arrives.
Basic Synchronization同期
Interrupt割り込みs, Race Conditions, and Critical Sections
The lecture this week covers enormous amounts of ground very quickly.
After much debate with myself, I decided to concentrate on
examples of this week's topic, working on the assumption
that you have done the readings. If you haven't, this lecture may be
difficult to follow. But if I don't make this assumption, we won't
become alchemists, and we won't get to Mars.
- We must achieve mutual exclusion相互排除 (相互排除、そうごはいじょ)
during what is known as a critical section.
- The simplest solution is to disable interrupt割り込みs.
- This solution doesn't work on multiprocessプロセスors.
- Devices are a limited form of multiprocessプロセスor. The solution here
often involves re-testing the state after reenabling interrupt割り込みs. At
Network Alchemy, we ran into exactly this problem.
- Caches are a problem.
- The test and set lock instruction is one multiprocessプロセスor
solution; it requires bus support, as well.
- When the lock attempt fails, you can either busy wait or go
to sleep. Busy waiting is known as a spin lock.
- In general, the hardware must support some sort of atomic
primitive, but if the hardware itself is simple enough, there are
answers such as Peterson's solution for alternating execution, and a
ring buffer for a simple producer/consumer system (common for
e.g. network adapters).
- Some systems, such a SCSI bus, use a simple priority
mechanism, which can preempting or non-preempting.
Dining Philosophers哲学者: Resource allocation and
deadlock
哲学者の食事問題
- Plato
- Confucius 孔子(こうし)
- Socrates
- Voltaire
- Descartes
Perhaps the most famous resource management管理 problem is the dining
philosophers problem. Several philosophers (哲学者、てつがくしゃ)
are having dinner at an Italian restaurant, where each philosopher
plans to use two forks to eat his spaghetti. They are sitting at a
round table, with one fork between each pair of philosophers (there
are the same number of forks as philosophers). Each philosopher
spends a random amount of time thinking, then suddenly decides he is
hungry, and grabs the fork on his left, then the fork on his right.
If all of the philosophers grab their left fork at the same time, then
no one will be able to get his or her right fork!
Deadlock occurs, the system comes to a halt, no philosopher can
eat, eventually everyone starves and dies (unless the manager comes
along and notices that everyone is stuck, and forces one philosopher
to put down a fork and restart).
There
is an
excellent simulation at Northwestern University.
A full mathematical analysis is beyond our purposes at the moment, but
very roughly, if we treat time as discrete, the probability確率 of all
n philosophers not currently eating at a particular time
and all attempting to get forks at the same time is roughly
(fp)^n where f is the fraction of time that
a philosopher is eating, and p is the probability確率 that,
when not eating, she will attempt to eat. This should also give you
some idea of how difficult it is to debug such a problem:
reproducing the problem is always the first step, and that will
be difficult.
There are many solutions to the problem; one simple one is to have
everyone put down their first fork when they fail to get the
second fork, pause for a few moments (randomly), then try again.
(What are the problems with this?) Another is priority,
either with or without preemption, ordering the philosophers
and letting them decide in order whether or not to pick up forks at a
particular moment. (What are the problems with this
solution?)
One useful technique for finding deadlocks when they occur, or
for designing locking systems, is the concept of a resource
graph. Entities that currently hold or desire certain locks are
nodes in the graph. A directed link is drawn from each node that
desires a resource currently held by someone else to the node holding
the resource. If there is a cycle in the graph, you have
deadlock.
Readers and Writers: Consistency
The dining philosophers哲学者 problem treats all resources as the same.
However, in many systems, there may be many types of resources,
or they may be different uses of the same set of resources. In
this section, we will talk about how to support two important uses of
objects, reading and writing.
In file systems and databases, it is often necessary to present one
consistent version of a data structure. It may not matter whether a
reader sees the version before a change to structure takes
place or after, but during would be a problem. The
simplest solution is to put a simple lock on the structure and only
allow one processプロセス to read or write the data structure at a
time. This, unfortunately, is not very efficient.
By recognizing that many people can read the data structure without
interfering with each other, we divide the lock into two roles: a
read lock and a write lock. If a processプロセス requests
either role, and the structure is idle, the lock is granted. If a
reader arrives and another reader already has the structure locked,
then both are allowed to read. If a writer arrives and the
structure is locked by either a reader or another writer, then
the writer blocks and must wait until the lock is freed.
So far, so good. We have allowed multiple readers or a
single writer. The tricky part is to prioritize appropriately
those who are waiting for the lock. Causing the writer to wait while
allowing new readers to continue to enter risks starving the writer.
The simplest good solution is to queue new read requests
behind the write request. It's not necessarily efficient,
depending on the behavior of the readers, but it will usually do.
Another alternative is to create versions of the data
structure; rather than modifying it in place, copy part of the
structure so that it can be modified, and allow new readers to
continue to come in and access the old version. Once the modification
(writing) is complete, a simple pointer switch can put the new version
in place so that subsequent readers will get the new version.
Sleeping Barber: Requesting service
Unfortunately, we probably don't have time to discuss this one in
class, but I encourage you to finish the reading, if you haven't.
Message passing v. shared memory
We have mostly discussed various forms of synchronization同期, although
the title of the lecture was Inter-Processプロセス Communication通信. Now let's
return to the exchange of data between two entities. I say "entities"
because they can be either processプロセスes or threads. In the case of
threads, or sometimes processプロセスes, messages can be passed via shared
memory. One thread or processプロセス passes a pointer to the data
to the other. This is very efficient, because the data does not have
to be copied. However, there are still two issues: resource
management管理 and naming. Control of the message buffer is assumed to
pass from the producer (生産者) to the consumer (消費者).
Questions:
- How is the buffer named?
- How is the size of the buffer or message constrained?
(Cooperatively.)
- What does the consumer do with the buffer once it is done using
the buffer?
- What stops the producer from modifying the buffer once it has sent
the message? (Nothing.) What happens?
This approach is common when controlling devices such as network
interface cards, and the command blocks for devices such as SCSI
adapters.
The alternative is message passing. Message passing involves
copying the data from one processプロセス to the other, which is less
efficient, but has lots of advantages. Control of buffers is much
clearer, the contents can't be modified, and the messages also serve
as a natural means of synchronizing and ordering interactions.
Granularity of Locking
The first versions of Linux did not support multiprocessプロセスors. The
first version that did (2.0?) used what was known as the Giant
Kernel Lock. This system has long since been replaced by a much
finer-grained locking scheme.
In file systems or databases, there are tradeoffs between
fine-grained locking and large-grained locking.
Fine-grained locking is a lot harder to get right, and there
are a lot more locking operations so the efficiency of the locking
operation itself must be considered, but it allows more
concurrency同時実行・平行.
Priority Inversion優先順位の逆転 on Mars
...which brings us to Mars.
In 1997, the Mars Pathfinder craft landed on Mars. Its CPU ran the
VxWorks embedded組込み用 operating systemオペレーティングシステム. Famously, this mission had
a software bug related to resource locking and scheduling, so it
serves as a nice bridge between this lecture and the next. It is a
classic case of priority inversion優先順位の逆転 (優先順位の逆転、ゆうせんじゅんい
のがくてん), and shows the necessity of
priority inheritance優先度継承 (優先度継承、ゆうせんどけいしょう).
To understand this problem, you need to know the most basic facts
about scheduling: most schedulers support a priority
mechanism, and no lower-priority task gets to run as long as a
higher-priority one wants the CPU. (We will discuss variations on
this next week, but the basic idea holds.)
The system included numerous tasks, each assigned a different
role. There are several that we care about for the purposes of this
discussion:
- The bus scheduler (top priority among the tasks we are considering)
- The bus transfer task (next priority)
- Several medium-priority science tasks
- The low-priority ASI/MET meteorological (weather) task
In normal operation, the bus scheduler and transfer tasks alternate.
Under abnormal conditions, if the scheduler notices that the transfer
task has not run, the scheduler will reset the entire system. The
problem occurred when ASI/MET acquired a lock needed by the bus
transfer task, then the ASI/MET task was blocked from running
by the medium-priority science tasks. Thus, ASI/MET never gave up the
lock, and the transfer task never got to run, despite being higher
priority than any of the science tasks.
This problem was solved by enabling a mechanism known as priority inheritance優先度継承. Assume a low-priority task is holding a particular
resource. When a higher-priority task requests the resource, and
blocks waiting for it, the lower-priority task then inherits
the priority of the task waiting for resources it holds.
Summary
Although we have invoked relativity相対性理論 and naming a few
times during this lecture, both problems are much more interesting
when you talk about doing inter-processプロセス communication通信 in a
distributed分散 fashion.
The general problem of synchronization同期 and resource locking, which
Tanenbaum lumps in with IPC, requires a moderate amount of hardware
support, but allows both HW and SW solutions. This area is one of the
places where you can most directly see the impact of theory on
practice.
Homeworkかだい
This week's homeworkかだい is mostly theoretical:
- Ignoring the deadlock for a moment, what happens at the
philosophers' table when one philosopher dies while hold a fork?
- One possible solution to the dining philosophers哲学者 problem is to
number the forks, and require the philophers to always pick up
an even-numbered fork first.
- Does this scheme work?
- What happens when there are an odd number of forks?
- Can this scheme be extended to work when you need three
forks to eat?
- Another possible solution, since all forks are identical, is to
pile all of the forks in the middle of the table, and have the
philosophers grab any two when they want to eat. Does this work
better?
- Analyze分析 the probability確率 of deadlock, treating time as discrete,
based on the number of philosophers, probability確率 of a philospher
wanting to eat, and number of forks.
- I give you a red disk that you can use as a marker. How would
create a protocol that guarantees deadlock avoidance? Is it robust
against the death of one of the philosophers?
- Describe how the original Ethernet CSMA/CD scheme is like the
dining philosophers哲学者 problem.
- Find the synchronization同期 primitive used in an Intel Core Duo
dual-processプロセスor or an AMD on-chip multiprocessプロセスor.
- Write a program that copies one chunk of memory to another (your
OS certainly provides some memory copy library routine).
Measure its performance. (We will use this information情報 in later
exercises in the course.)
Next Lecture
Next lecture:
第5回 5月20日 プロセススケジューリング
Lecture 5, May 20: Processプロセス Scheduling
Readings for next week and followup for this week:
その他 Additional Information情報