慶應義塾大学
2017年度春学期

Operating Systems

2017年度春学期
開講場所：SFC
授業形態：講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第11回 5月18日プロセス間通信、コンカランシー、デッドロック
Lecture 11, May 18: Inter-process Communication, Concurrency, Deadlock

Outline

Our goal today is to get to Mars.

Inter-Process Communication
- Types and purposes of IPC
- Pipes
- Message passing v. shared memory
Synchronization
- Basic Synchronization: Interrupts, Race Conditions and Critical Sections
- Simple Software Solutions
- Semaphores
- Mutexes
- Monitors
Problems: Deadlock and More

Dining Philosophers: Resource allocation and deadlock
Readers and Writers: Consistency
Sleeping Barber: Requesting service
Granularity of Locking
Priority Inversion on Mars

Finding the Trap Instruction

Some old info you might find useful:

The assembly code exercise produced interesting results. The different approaches of the library codes, even among Linux distributions, showed various choices in library design: trading shared code versus efficiency and modularity. Some of the differences are probably also due to different executable file formats, such as ELF, which we will not go into; tradeoffs there include efficiency of loading the file from disk, support for shared libraries, monitoring and debugging. You may want to compare your results with the other members of the class.

But no one actually found the instruction that traps into the kernel. Everyone stopped at a call instruction, which calls a subroutine but not the kernel itself. Even the apparent calls to a function called write are actually library calls.

On my Linux Fedora Core 6 box using an i686 kernel with an Intel Celeron M microprocessor, the assembly version of the program looks something like this:

	.file	"tmp-write.c"
	.section	.rodata
.LC0:
	.string	"123"
	.text
.globl main
	.type	main, @function
main:
	leal	4(%esp), %ecx
	andl	$-16, %esp
	pushl	-4(%ecx)
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ecx
	subl	$36, %esp
	movl	$.LC0, -8(%ebp)
	movl	$3, 8(%esp)
	movl	-8(%ebp), %eax
	movl	%eax, 4(%esp)
	movl	$1, (%esp)
	call	write
	movl	$0, %eax
	addl	$36, %esp
	popl	%ecx
	popl	%ebp
	leal	-4(%ecx), %esp
	ret
	.size	main, .-main
	.ident	"GCC: (GNU) 4.1.1 20070105 (Red Hat 4.1.1-51)"
	.section	.note.GNU-stack,"",@progbits

The starting of the actual program proceeds roughly as follows:

(magic number check finds ELF executable)

_start
__libc_start_main@plt
_dl_runtime_resolve
_dl_fixup 
(approximately 1400 instructions later...)
_init
call_gmon_start (only for programs using gmon monitoring)
(approximately 100 instructions later...)
main

Once we get to main, it's only thirteen instructions to write(), right? Not quite. That call write instruction actually calls a library wrapper routine that does various things before actually making the system call...

call write
_dl_runtime_resolve
_dl_fixup
_dl_lookup_symbol_x
(calls strcmp, do_lookup_x...)
(approximately 700 instructions later, hit a break point...)
(gdb) stepi
0x0021e018 in write () from /lib/libc.so.6
1: x/i $pc  0x21e018 :	jne    0x21e03c 
(gdb) 
0x0021e01a in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e01a <__write_nocancel>:	push   %ebx
(gdb) 
0x0021e01b in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e01b <__write_nocancel+1>:	mov    0x10(%esp),%edx
(gdb) 
0x0021e01f in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e01f <__write_nocancel+5>:	mov    0xc(%esp),%ecx
(gdb) 
0x0021e023 in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e023 <__write_nocancel+9>:	mov    0x8(%esp),%ebx
(gdb) 
0x0021e027 in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e027 <__write_nocancel+13>:	mov    $0x4,%eax
(gdb) 
0x0021e02c in __write_nocancel () from /lib/libc.so.6
1: x/i $pc  0x21e02c <__write_nocancel+18>:	call   *%gs:0x10
(gdb) 
0x0095c400 in __kernel_vsyscall ()
1: x/i $pc  0x95c400 <__kernel_vsyscall>:	int    $0x80

Breakpoint 4, 0x001bc400 in __kernel_vsyscall ()
(gdb) where
#0  0x001bc400 in __kernel_vsyscall ()
#1  0x0027b033 in __write_nocancel () from /lib/libc.so.6
#2  0x08048387 in main () at tmp-write.c:7
(gdb) x/2i __kernel_vsyscall
0x1bc400 <__kernel_vsyscall>:	int    $0x80
0x1bc402 <__kernel_vsyscall+2>:	ret    
(gdb)

Inter-Process Communication

Types and Purposes of Inter-Process Communication (IPC)

Last week we saw Unix signals, which are the most primitive form of inter-process communication. One process can essentially send about four bits of information to another processor using kill(). Signals are unsolicited and asynchronous. This information can be used for basic time synchronization, but that's about it. What are the uses of IPC?

Time synchronization (I've finished; your turn)
Resource management (I'm using the disk)
Data transfer for further processing

Pipes

The simplest form of IPC invented to date is the Unix pipe. You probably use them every day without even thinking about it. A pipe connects the stdout of a process generating some data to the stdin of a process consuming the data.

[rdv@localhost linux-2.6.19]$ find . -name \*.c -print | xargs grep do_fork | more
./kernel/fork.c:long do_fork(unsigned long clone_flags,
./kernel/fork.c: * functions used by do_fork() cannot be used here directly
./arch/um/kernel/process.c:     pid = do_fork(CLONE_VM | CLONE_UNTRACED | flags, 0,
./arch/um/kernel/process.c:             panic("do_fork failed in kernel_thread, errno = %d", pid);
./arch/um/kernel/syscall.c:     ret = do_fork(SIGCHLD, UPT_SP(¤t->thread.regs.regs),
./arch/um/kernel/syscall.c:     ret = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD,
./arch/um/sys-i386/syscalls.c:  ret = do_fork(clone_flags, newsp, ¤t->thread.regs, 0, parent_tid,
./arch/um/sys-x86_64/syscalls.c:        ret = do_fork(clone_flags, newsp, ¤t->thread.regs, 0, pare
nt_tid,
./arch/cris/arch-v10/kernel/process.c:        return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, ®s
, 0, NULL, NULL);
./arch/cris/arch-v10/kernel/process.c:  return do_fork(SIGCHLD, rdusp(), regs, 0, NULL, NULL);
./arch/cris/arch-v10/kernel/process.c:  return do_fork(flags, newusp, regs, 0, parent_tid, child_tid);
./arch/cris/arch-v10/kernel/process.c:        return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, rdusp(), 
regs, 0, NULL, NULL);
./arch/cris/kernel/process.c: *  sys_clone and do_fork have a new argument, user_tid
--More--

involves three processes connected via two pipes: one each for find, xargs, and more. Oh, and the xargs actually repeatedly forks off calls to grep, which use xarg's stdout, so the total number of processes involved is actually large.

In a VMS system, this kind of operation was substantially more tedious; pipes are one of the features that made Unix such a hacker's paradise. (In VMS, such IPC was usually done either with temporary files, or using an explicit construct known as a mailbox, although it was also possible to redirect the input and output files, known as SYS$INPUT and SYS$OUTPUT.)

Pipes work well partly because of the simplicity of naming due to the semantics of fork. The pipe system call gives a simple example, showing how two file descriptors are created for pipe and shared through the fork. (There is also a special form of pipe known as a named pipe which we aren't going to discuss, but you might want to look up if you are interested.)

Pipes also allow for a simple form of parallel processing and asynchronous operation; the first process in the pipeline can be reading data from the disk and buffering it through the pipe to the second process, without the inherent complexities of asynchronous read operations. The first process automatically blocks when the pipe's buffers are full, and the second process automatically blocks when it tries to read from an empty pipe, and is awakened when data arrives.

Message passing v. shared memory

We have mostly discussed various forms of synchronization, although the title of the lecture was Inter-Process Communication. Now let's return to the exchange of data between two entities. I say "entities" because they can be either processes or threads. In the case of threads, or sometimes processes, messages can be passed via shared memory. One thread or process passes a pointer to the data to the other. This is very efficient, because the data does not have to be copied. However, there are still two issues: resource management and naming. Control of the message buffer is assumed to pass from the producer (生産者) to the consumer (消費者). Questions:

How is the buffer named?
How is the size of the buffer or message constrained? (Cooperatively.)
What does the consumer do with the buffer once it is done using the buffer?
What stops the producer from modifying the buffer once it has sent the message? (Nothing.) What happens?

This approach is common when controlling devices such as network interface cards, and the command blocks for devices such as SCSI adapters.

The alternative is message passing. Message passing involves copying the data from one process to the other, which is less efficient, but has lots of advantages. Control of buffers is much clearer, the contents can't be modified, and the messages also serve as a natural means of synchronizing and ordering interactions.

Synchronization

Basic Synchronization
Interrupts, Race Conditions, and Critical Sections

The lecture this week covers enormous amounts of ground very quickly. After much debate with myself, I decided to concentrate on examples of this week's topic, working on the assumption that you have done the readings. If you haven't, this lecture may be difficult to follow. But if I don't make this assumption, we won't become alchemists, and we won't get to Mars.

Race conditions on global variables:

Hold your hands up in front of your face, palms toward you, fingers pointed toward each other. How many different ways can you interleave your fingers?

We must achieve mutual exclusion (相互排除、そうごはいじょ) during what is known as a critical section.
The simplest solution is to disable interrupts.
This solution doesn't work on multiprocessors.
Devices are a limited form of multiprocessor. The solution here often involves re-testing the state after reenabling interrupts. At Network Alchemy, we ran into exactly this problem.
Caches are a problem.
The test and set lock instruction is one multiprocessor solution; it requires bus support, as well.
When the lock attempt fails, you can either busy wait or go to sleep. Busy waiting is known as a spin lock.
In general, the hardware must support some sort of atomic primitive, but if the hardware itself is simple enough, there are answers such as Peterson's solution for alternating execution, and a ring buffer for a simple producer/consumer system (common for e.g. network adapters).
Some systems, such a SCSI bus, use a simple priority mechanism, which can preempting or non-preempting.

Simple Software Solutions

Strict alternation is a very simple solution that requires no hardware support, but does require that the processes cooperate:

// Process 0 (Alice)
while (TRUE) {
  while (turn != 0); // loop until it's our turn
  // now do the critical stuff
  critical_region();
  turn = 1; // let it be Bob's turn
  // while he's doing critical stuff, we can do non-critical stuff
  noncritical_region();
}

// Process 1 (Bob)
while (TRUE) {
  while (turn != 1); // loop until it's our turn
  // now do the critical stuff
  critical_region();
  turn = 0; // let it be Alice's turn
  // while she's doing critical stuff, we can do non-critical stuff
  noncritical_region();
}

In 1981, Peterson developed a means of doing this in software only, still requiring processes to be well-behaved, but not requiring hardware support, and relaxing the constraint for strictly alternating use.

Semaphore

Before even I was born, let alone any of you, Edsger Dijkstra (yes, that Dijkstra) invented semaphores. You can think of a semaphore as controlling access to some set of resources. It has a name, and a counter for the number of available resources.

The two operations on a semaphore are up() (or V(), in Dijkstra's original terminology) and down() (P()).

You acquire access to a resource by reducing the number available to other people (down). If the number was zero, the call will block until one is available, then you acquire the resource and are allowed to run.

Mutex

A mutex or mutual exclusion, is a binary semaphore, where you either have the one resource, or you don't. Implementing mutexes is somewhat simpler.

Monitor

Unfortunately, simply using semaphores and mutexes directly is often very buggy. Today there are many libraries and approaches to controlling access to resources or critical sections of code. The first important higher-level construct was the monitor, developed in two slightly different forms by Tony Hoare and Per Brinch Hansen in the 1970s. (No, this is not the monitor that is your screen. It is also different from the monitor of early OSes such as TOPS-10 and TOPS-20, where the term "monitor" referred to what today we would call the kernel.)

Problems: Deadlock and More

Dining Philosophers: Resource allocation and deadlock
哲学者の食事問題

"When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone." (Illogical statute passed by the Kansas legislature.) Dining philosophers, from Wikipedia

Plato
Confucius 孔子（こうし）
Socrates
Voltaire
Descartes

Perhaps the most famous resource management problem is the dining philosophers problem. Several philosophers (哲学者、てつがくしゃ) are having dinner at an Italian restaurant, where each philosopher plans to use two forks to eat his spaghetti. They are sitting at a round table, with one fork between each pair of philosophers (there are the same number of forks as philosophers). Each philosopher spends a random amount of time thinking, then suddenly decides he is hungry, and grabs the fork on his left, then the fork on his right. If all of the philosophers grab their left fork at the same time, then no one will be able to get his or her right fork! Deadlock occurs, the system comes to a halt, no philosopher can eat, eventually everyone starves and dies (unless the manager comes along and notices that everyone is stuck, and forces one philosopher to put down a fork and restart).

There is an excellent simulation at Northwestern University. Newer version at newer version.

A full mathematical analysis is beyond our purposes at the moment, but very roughly, if we treat time as discrete, the probability of all n philosophers not currently eating at a particular time and all attempting to get forks at the same time is roughly (fp)^n where f is the fraction of time that a philosopher is eating, and p is the probability that, when not eating, she will attempt to eat. This should also give you some idea of how difficult it is to debug such a problem: reproducing the problem is always the first step, and that will be difficult.

There are many solutions to the problem; one simple one is to have everyone put down their first fork when they fail to get the second fork, pause for a few moments (randomly), then try again. (What are the problems with this?) Another is priority, either with or without preemption, ordering the philosophers and letting them decide in order whether or not to pick up forks at a particular moment. (What are the problems with this solution?)

One useful technique for finding deadlocks when they occur, or for designing locking systems, is the concept of a resource graph. Entities that currently hold or desire certain locks are nodes in the graph. A directed link is drawn from each node that desires a resource currently held by someone else to the node holding the resource. If there is a cycle in the graph, you have deadlock.

Readers and Writers: Consistency

The dining philosophers problem treats all resources as the same. However, in many systems, there may be many types of resources, or they may be different uses of the same set of resources. In this section, we will talk about how to support two important uses of objects, reading and writing.

In file systems and databases, it is often necessary to present one consistent version of a data structure. It may not matter whether a reader sees the version before a change to structure takes place or after, but during would be a problem. The simplest solution is to put a simple lock on the structure and only allow one process to read or write the data structure at a time. This, unfortunately, is not very efficient.

By recognizing that many people can read the data structure without interfering with each other, we divide the lock into two roles: a read lock and a write lock. If a process requests either role, and the structure is idle, the lock is granted. If a reader arrives and another reader already has the structure locked, then both are allowed to read. If a writer arrives and the structure is locked by either a reader or another writer, then the writer blocks and must wait until the lock is freed.

So far, so good. We have allowed multiple readers or a single writer. The tricky part is to prioritize appropriately those who are waiting for the lock. Causing the writer to wait while allowing new readers to continue to enter risks starving the writer. The simplest good solution is to queue new read requests behind the write request. It's not necessarily efficient, depending on the behavior of the readers, but it will usually do.

Another alternative is to create versions of the data structure; rather than modifying it in place, copy part of the structure so that it can be modified, and allow new readers to continue to come in and access the old version. Once the modification (writing) is complete, a simple pointer switch can put the new version in place so that subsequent readers will get the new version.

Sleeping Barber: Requesting service

Unfortunately, we probably don't have time to discuss this one in class, but I encourage you to finish the reading, if you haven't.

The Lost Wakeup

One of the trickiest problems is the lost wakeup. We encountered this in an operating system development project I was managing in the year 2000, in the form of a lost interrupt.

A packet arriving at the NIC sets a particular flag and interrupts the processor.
The CPU takes the interrupt, and calls the interrupt handler with interrupts disabled.
Figure out which interrupt happened
clearing the flag that says what interrupt happened.
Deal with the interrupt.
The handler checks the flags to see if any more packets had arrived.
Interrupts are reenabled.
Return from interrupt to regular processing.

Can you see a problem with this?

Granularity of Locking

The first versions of Linux did not support multiprocessors. The first version that did (2.0?) used what was known as the Giant Kernel Lock. This system has long since been replaced by a much finer-grained locking scheme.

In file systems or databases, there are tradeoffs between fine-grained locking and large-grained locking. Fine-grained locking is a lot harder to get right, and there are a lot more locking operations so the efficiency of the locking operation itself must be considered, but it allows more concurrency.

Priority Inversion on Mars

...which brings us to Mars.

In 1997, the Mars Pathfinder craft landed on Mars. Its CPU ran the VxWorks embedded operating system. Famously, this mission had a software bug related to resource locking and scheduling, so it serves as a nice bridge between this lecture and the next. It is a classic case of priority inversion (優先順位の逆転、ゆうせんじゅんいのぎゃくてん), and shows the necessity of priority inheritance (優先度継承、ゆうせんどけいしょう）.

To understand this problem, you need to know the most basic facts about scheduling: most schedulers support a priority mechanism, and no lower-priority task gets to run as long as a higher-priority one wants the CPU. (We will discuss variations on this next week, but the basic idea holds.) The system included numerous tasks, each assigned a different role. There are several that we care about for the purposes of this discussion:

The bus scheduler (top priority among the tasks we are considering)
The bus transfer task (next priority)
Several medium-priority science tasks
The low-priority ASI/MET meteorological (weather) task

In normal operation, the bus scheduler and transfer tasks alternate. Under abnormal conditions, if the scheduler notices that the transfer task has not run, the scheduler will reset the entire system. The problem occurred when ASI/MET acquired a lock needed by the bus transfer task, then the ASI/MET task was blocked from running by the medium-priority science tasks. Thus, ASI/MET never gave up the lock, and the transfer task never got to run, despite being higher priority than any of the science tasks.

This problem was solved by enabling a mechanism known as priority inheritance. Assume a low-priority task is holding a particular resource. When a higher-priority task requests the resource, and blocks waiting for it, the lower-priority task then inherits the priority of the task waiting for resources it holds.

Summary

Although we have invoked relativity and naming a few times during this lecture, both problems are much more interesting when you talk about doing inter-process communication in a distributed fashion.

The general problem of synchronization and resource locking, which Tanenbaum lumps in with IPC, requires a moderate amount of hardware support, but allows both HW and SW solutions. This area is one of the places where you can most directly see the impact of theory on practice.

Homework

Better get serious about your projects.

Next Lecture

Next lecture:

第6回 4月25日プロセススケジューリング
Lecture 6, April 25: Process Scheduling

Readings for next week and followup for this week:

Tanenbaum, 2.5
Japanese Wikipedia page on the dining philosophers
Mars Pathfinder priority inversion problem
Mars Pathfinder response from the team leader
優先度継承 priority inheritance at Wikipedia.
優先順位の逆転 priority inversion at Wikipedia.
If you are interested in deadlock, I highly recommend Chapter 3 of Tanenbaum.
The paper Scalable synchronous queues, by Scherer et al., presents a new means of managing concurrency.