But no one actually found the instruction that traps into the kernel. Everyone stopped at a call instruction, which calls a subroutine but not the kernel itself. Even the apparent calls to a function called write are actually library calls.
On my Linux Fedora Core 6 box using an i686 kernel with an Intel Celeron M microprocessor, the assembly version of the program looks something like this:
.file "tmp-write.c" .section .rodata .LC0: .string "123" .text .globl main .type main, @function main: leal 4(%esp), %ecx andl $-16, %esp pushl -4(%ecx) pushl %ebp movl %esp, %ebp pushl %ecx subl $36, %esp movl $.LC0, -8(%ebp) movl $3, 8(%esp) movl -8(%ebp), %eax movl %eax, 4(%esp) movl $1, (%esp) call write movl $0, %eax addl $36, %esp popl %ecx popl %ebp leal -4(%ecx), %esp ret .size main, .-main .ident "GCC: (GNU) 4.1.1 20070105 (Red Hat 4.1.1-51)" .section .note.GNU-stack,"",@progbitsThe starting of the actual program proceeds roughly as follows:
(magic number check finds ELF executable) _start __libc_start_main@plt _dl_runtime_resolve _dl_fixup (approximately 1400 instructions later...) _init call_gmon_start (only for programs using gmon monitoring) (approximately 100 instructions later...) mainOnce we get to main, it's only thirteen instructions to write(), right? Not quite. That call write instruction actually calls a library wrapper routine that does various things before actually making the system call...
call write _dl_runtime_resolve _dl_fixup _dl_lookup_symbol_x (calls strcmp, do_lookup_x...) (approximately 700 instructions later, hit a break point...) (gdb) stepi 0x0021e018 in write () from /lib/libc.so.6 1: x/i $pc 0x21e018: jne 0x21e03c (gdb) 0x0021e01a in __write_nocancel () from /lib/libc.so.6 1: x/i $pc 0x21e01a <__write_nocancel>: push %ebx (gdb) 0x0021e01b in __write_nocancel () from /lib/libc.so.6 1: x/i $pc 0x21e01b <__write_nocancel+1>: mov 0x10(%esp),%edx (gdb) 0x0021e01f in __write_nocancel () from /lib/libc.so.6 1: x/i $pc 0x21e01f <__write_nocancel+5>: mov 0xc(%esp),%ecx (gdb) 0x0021e023 in __write_nocancel () from /lib/libc.so.6 1: x/i $pc 0x21e023 <__write_nocancel+9>: mov 0x8(%esp),%ebx (gdb) 0x0021e027 in __write_nocancel () from /lib/libc.so.6 1: x/i $pc 0x21e027 <__write_nocancel+13>: mov $0x4,%eax (gdb) 0x0021e02c in __write_nocancel () from /lib/libc.so.6 1: x/i $pc 0x21e02c <__write_nocancel+18>: call *%gs:0x10 (gdb) 0x0095c400 in __kernel_vsyscall () 1: x/i $pc 0x95c400 <__kernel_vsyscall>: int $0x80 Breakpoint 4, 0x001bc400 in __kernel_vsyscall () (gdb) where #0 0x001bc400 in __kernel_vsyscall () #1 0x0027b033 in __write_nocancel () from /lib/libc.so.6 #2 0x08048387 in main () at tmp-write.c:7 (gdb) x/2i __kernel_vsyscall 0x1bc400 <__kernel_vsyscall>: int $0x80 0x1bc402 <__kernel_vsyscall+2>: ret (gdb)
[rdv@localhost linux-2.6.19]$ find . -name \*.c -print | xargs grep do_fork | more ./kernel/fork.c:long do_fork(unsigned long clone_flags, ./kernel/fork.c: * functions used by do_fork() cannot be used here directly ./arch/um/kernel/process.c: pid = do_fork(CLONE_VM | CLONE_UNTRACED | flags, 0, ./arch/um/kernel/process.c: panic("do_fork failed in kernel_thread, errno = %d", pid); ./arch/um/kernel/syscall.c: ret = do_fork(SIGCHLD, UPT_SP(¤t->thread.regs.regs), ./arch/um/kernel/syscall.c: ret = do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, ./arch/um/sys-i386/syscalls.c: ret = do_fork(clone_flags, newsp, ¤t->thread.regs, 0, parent_tid, ./arch/um/sys-x86_64/syscalls.c: ret = do_fork(clone_flags, newsp, ¤t->thread.regs, 0, pare nt_tid, ./arch/cris/arch-v10/kernel/process.c: return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, ®s , 0, NULL, NULL); ./arch/cris/arch-v10/kernel/process.c: return do_fork(SIGCHLD, rdusp(), regs, 0, NULL, NULL); ./arch/cris/arch-v10/kernel/process.c: return do_fork(flags, newusp, regs, 0, parent_tid, child_tid); ./arch/cris/arch-v10/kernel/process.c: return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, rdusp(), regs, 0, NULL, NULL); ./arch/cris/kernel/process.c: * sys_clone and do_fork have a new argument, user_tid --More--involves three processes connected via two pipes: one each for find, xargs, and more. Oh, and the xargs actually repeatedly forks off calls to grep, which use xarg's stdout, so the total number of processes involved is actually large.
In a VMS system, this kind of operation was substantially more tedious; pipes are one of the features that made Unix such a hacker's paradise. (In VMS, such IPC was usually done either with temporary files, or using an explicit construct known as a mailbox, although it was also possible to redirect the input and output files, known as SYS$INPUT and SYS$OUTPUT.)
Pipes work well partly because of the simplicity of naming due to the semantics of fork. The pipe system call gives a simple example, showing how two file descriptors are created for pipe and shared through the fork. (There is also a special form of pipe known as a named pipe which we aren't going to discuss, but you might want to look up if you are interested.)
Pipes also allow for a simple form of parallel processing and asynchronous operation; the first process in the pipeline can be reading data from the disk and buffering it through the pipe to the second process, without the inherent complexities of asynchronous read operations. The first process automatically blocks when the pipe's buffers are full, and the second process automatically blocks when it tries to read from an empty pipe, and is awakened when data arrives.
The alternative is message passing. Message passing involves copying the data from one process to the other, which is less efficient, but has lots of advantages. Control of buffers is much clearer, the contents can't be modified, and the messages also serve as a natural means of synchronizing and ordering interactions.
Strict alternation is a very simple solution that requires no hardware support, but does require that the processes cooperate:
// Process 0 (Alice) while (TRUE) { while (turn != 0); // loop until it's our turn // now do the critical stuff critical_region(); turn = 1; // let it be Bob's turn // while he's doing critical stuff, we can do non-critical stuff noncritical_region(); } // Process 1 (Bob) while (TRUE) { while (turn != 1); // loop until it's our turn // now do the critical stuff critical_region(); turn = 0; // let it be Alice's turn // while she's doing critical stuff, we can do non-critical stuff noncritical_region(); }
In 1981, Peterson developed a means of doing this in software only, still requiring processes to be well-behaved, but not requiring hardware support, and relaxing the constraint for strictly alternating use.
Before even I was born, let alone any of you, Edsger Dijkstra (yes, that Dijkstra) invented semaphores. You can think of a semaphore as controlling access to some set of resources. It has a name, and a counter for the number of available resources.
The two operations on a semaphore are up() (or V(), in Dijkstra's original terminology) and down() (P()).
You acquire access to a resource by reducing the number available to other people (down). If the number was zero, the call will block until one is available, then you acquire the resource and are allowed to run.
A mutex or mutual exclusion, is a binary semaphore, where you either have the one resource, or you don't. Implementing mutexes is somewhat simpler.
Unfortunately, simply using semaphores and mutexes directly is often very buggy. Today there are many libraries and approaches to controlling access to resources or critical sections of code. The first important higher-level construct was the monitor, developed in two slightly different forms by Tony Hoare and Per Brinch Hansen in the 1970s. (No, this is not the monitor that is your screen. It is also different from the monitor of early OSes such as TOPS-10 and TOPS-20, where the term "monitor" referred to what today we would call the kernel.)
There is an excellent simulation at Northwestern University. Newer version at newer version.
A full mathematical analysis is beyond our purposes at the moment, but very roughly, if we treat time as discrete, the probability of all n philosophers not currently eating at a particular time and all attempting to get forks at the same time is roughly (fp)^n where f is the fraction of time that a philosopher is eating, and p is the probability that, when not eating, she will attempt to eat. This should also give you some idea of how difficult it is to debug such a problem: reproducing the problem is always the first step, and that will be difficult.
There are many solutions to the problem; one simple one is to have everyone put down their first fork when they fail to get the second fork, pause for a few moments (randomly), then try again. (What are the problems with this?) Another is priority, either with or without preemption, ordering the philosophers and letting them decide in order whether or not to pick up forks at a particular moment. (What are the problems with this solution?)
One useful technique for finding deadlocks when they occur, or for designing locking systems, is the concept of a resource graph. Entities that currently hold or desire certain locks are nodes in the graph. A directed link is drawn from each node that desires a resource currently held by someone else to the node holding the resource. If there is a cycle in the graph, you have deadlock.
In file systems and databases, it is often necessary to present one consistent version of a data structure. It may not matter whether a reader sees the version before a change to structure takes place or after, but during would be a problem. The simplest solution is to put a simple lock on the structure and only allow one process to read or write the data structure at a time. This, unfortunately, is not very efficient.
By recognizing that many people can read the data structure without interfering with each other, we divide the lock into two roles: a read lock and a write lock. If a process requests either role, and the structure is idle, the lock is granted. If a reader arrives and another reader already has the structure locked, then both are allowed to read. If a writer arrives and the structure is locked by either a reader or another writer, then the writer blocks and must wait until the lock is freed.
So far, so good. We have allowed multiple readers or a single writer. The tricky part is to prioritize appropriately those who are waiting for the lock. Causing the writer to wait while allowing new readers to continue to enter risks starving the writer. The simplest good solution is to queue new read requests behind the write request. It's not necessarily efficient, depending on the behavior of the readers, but it will usually do.
Another alternative is to create versions of the data structure; rather than modifying it in place, copy part of the structure so that it can be modified, and allow new readers to continue to come in and access the old version. Once the modification (writing) is complete, a simple pointer switch can put the new version in place so that subsequent readers will get the new version.
One of the trickiest problems is the lost wakeup. We encountered this in an operating system development project I was managing in the year 2000, in the form of a lost interrupt.
Can you see a problem with this?
In file systems or databases, there are tradeoffs between fine-grained locking and large-grained locking. Fine-grained locking is a lot harder to get right, and there are a lot more locking operations so the efficiency of the locking operation itself must be considered, but it allows more concurrency.
To understand this problem, you need to know the most basic facts about scheduling: most schedulers support a priority mechanism, and no lower-priority task gets to run as long as a higher-priority one wants the CPU. (We will discuss variations on this next week, but the basic idea holds.) The system included numerous tasks, each assigned a different role. There are several that we care about for the purposes of this discussion:
This problem was solved by enabling a mechanism known as priority inheritance. Assume a low-priority task is holding a particular resource. When a higher-priority task requests the resource, and blocks waiting for it, the lower-priority task then inherits the priority of the task waiting for resources it holds.
The general problem of synchronization and resource locking, which Tanenbaum lumps in with IPC, requires a moderate amount of hardware support, but allows both HW and SW solutions. This area is one of the places where you can most directly see the impact of theory on practice.
第6回 4月25日 プロセススケジューリング
Lecture 6, April 25: Process Scheduling
Readings for next week and followup for this week: