In FreeBSD, Linux and most modern UNIX-like OSes, in the implementation you will find two terms: anonymous pages and page cache pages. Anonymous pages are those that are part of a process's virtual memory; they may have a page set aside for them somewhere on disk, but are not part of a file. Pages in the page cache, on the other hand, correspond directly to a part of a file. Typically they will have a space assigned on disk, but if they are brand new, the OS may not have gotten around to assigning one yet.
Those two mechanisms used to be very different, but in both FreeBSD and Linux the two systems merged quite a while ago. (The old SunOS kernel I did some work on when I was a grad student still had somewhat independent mechanisms for the two, which was very confusing.) We will take a lightweight look at these mechanisms over the next couple of lectures.
A disk drive stores data in sectors that held on tracks; all of the tracks at the same distance from the spindle are called a cylinder.
It uses a read/write head attached to a slider, mounted on an actuator arm, to read and write the data as it spins past.
In an architectural sense, what's important about disk drives?
...and yet, the Information Revolution (情報革命?) can fairly be said to be built on disk drives. Without them, there would be no PCs, no Google.
There are a number of common types of controller:
Obviously, all of this requires a lot of software; in Linux, there are almost 3,300 different device drivers! Don't worry, the complexity is actually quite manageable; we'll come back to that when we discuss the drivers themselves below.
Those methods refer to how the CPU talks to, or controls the device. In both cases, there are two primary ways to get your actual data out:
As you might expect from the initial discussion of hardware, there are several levels of device drivers, starting with software to control the actual buses and going on down to the devices. The bus drivers are used more or less as a library of functions for the actual device drivers.
In most modern systems, the device driver that matches a particular device can be loaded as a kernel module after the device is identified by the OS.
A device driver must follow a particular form, which is very dependent on the operating system. Over the last several years, there has been a push for OS-independent device drivers, so that OS developers can share the same code for a device independent of whether it was developed for Windows, Linux, or Mac.
In Unix, the code for a device driver is divided into the top half and the bottom half. (The bottom half is usually much less than half of the total code, though.) The bottom half is essentially the interrupt handler, and it must be prepared to run at any time, with the system in any state. The top half generally runs with the system set to the state (e.g., memory map) of the process that is scheduling (or has scheduled) the I/O.
Linux has adopted a different terminology for "top half" and "bottom half" of the device driver, compared to other versions of UNIX.
In FreeBSD, with heritage going back to the original PDP-11 UNIX here:
The section of a driver that services I/O requests is invoked because of system calls or by the virtual-memory system. This portion of the device driver executes synchronously in the top half of the kernel and is permitted to block by calling the sleep() routine. We commonly refer to this body of code as the top half of a device driver.
Interrupt service routines are invoked when the system fields an interrupt from a device. Consequently, these routines cannot depend on any per-process state. Historically they did not have a thread context of their own, so they could not block. In FreeBSD 5.2 an interrupt has its own thread context, so it can block if it needs to do so. However, the cost of extra thread switches is sufficiently high that for good performance device drivers should attempt to avoid blocking. We commonly refer to a device driver's interrupt service routines as the bottom half of a device driver.
In Linux here:
Linux (along with many other systems) resolves this problem by splitting the interrupt handler into two halves. The so-called top half is the routine that actually responds to the interrupt -- the one you register with request_irq. The bottom half is a routine that is scheduled by the top half to be executed later, at a safer time. The big difference between the top-half handler and the bottom half is that all interrupts are enabled during execution of the bottom half -- that's why it runs at a safer time. In the typical scenario, the top half saves device data to a device-specific buffer, schedules its bottom half, and exits: this operation is very fast. The bottom half then performs whatever other workis required, such as awakening processes, starting up another I/O operation, and so on. This setup permits the top half to service a new interrupt while the bottom half is still working.
(Thanks to Eden for pointing out this distinction, I was not aware of it.)
In general, "bottom half" historically meant "closer to the hardware" whereas "top half" meant "closer to the user process". The bottom half includes the interrupt service routine (ISR), and must not assume availability of resources such as the user process's virtual memory map.
Very, very interesting...
The ISR is the routine called to complete an I/O. There is some description of how this is done for SCSI disks here. n.b.: this discussion only seems to apply to older devices.
Pointers to the functions that are called when an application requests an operation on a character device are kept in the cdevsw structure, in http://fxr.watson.org/fxr/source/sys/conf.h. This is a good example of object-oriented programming in C.
Note the element d_strategy in the cdevsw above. That's what the paging system actually calls to ask for an I/O to be done. It sets up a bio struct for the buffered I/O.
The strategy routine generically uses an "elevator sort" to make I/Os efficient but avoid the "California Pizza Kitchen" problem of indefinitely postponing any given operation. See here for comments on the bio structures and the sort.
See here for a guide to writing BSD device drivers.
See http://fxr.watson.org/fxr/source/sys/bio.h for the struct bio, the kernel structure for a block I/O operation.
Ideally, devices would always identify themselves completely. Most devices provide some identification, but those that store data could, and should, make more effective use of the volume name, which is generally embedded in the device.
The principle reasons that I/O slows down are:
For tape drives, underflowing or overflowing a buffer results in a tape stall, which is extremely expensive.
In Unix systems, it is also true that disk I/Os are done in multiples of a page size, and the I/O is also done to page boundaries. So how are the API and the I/O system reconciled? Through the file system buffer cache. The buffer cache serves two important purposes: the first is alignment, and the second is buffering, to allow speed matching of I/O and allow the application to continue while I/O is handled by the kernel on its behalf.
Packets arrive into the system in a variety of sizes. Worse, in general, you don't know which process (if any!) wants the packet until you get it into memory and examine the headers.
These effects cumulatively mean that data copies are common in operating systems, and they have an enormous impact on system performance:
#include <stdio.h> #if defined(__i386__) static __inline__ unsigned long long rdtsc(void) { unsigned long long int x; __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x; } #elif defined(__x86_64__) static __inline__ unsigned long long rdtsc(void) { unsigned hi, lo; __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 ); } #elif defined(__powerpc__) static __inline__ unsigned long long rdtsc(void) { unsigned long long int result=0; unsigned long int upper, lower,tmp; __asm__ volatile( "0: \n" "\tmftbu %0 \n" "\tmftb %1 \n" "\tmftbu %2 \n" "\tcmpw %2,%0 \n" "\tbne 0b \n" : "=r"(upper),"=r"(lower),"=r"(tmp) ); result = upper; result = result<<32; result = result|lower; return(result); } #endif main() { int i; #define MAX 100 unsigned long long int vals[MAX]; unsigned long long int lastval = 0, thisval; for ( i = 0 ; i < MAX ; i++ ) vals[i] = 1; for ( i = 0 ; i < MAX ; i++ ) vals[i] = rdtsc(); for ( i = 0 ; i < MAX ; i++ ) printf("value: %llu delta: %llu\n",vals[i], i ? vals[i]-vals[i-1] : 0); printf("=====\n"); for ( i = 0 ; i < 100 ; i++ ) { thisval = rdtsc(); printf("value: %llu delta: %llu\n", thisval, thisval - lastval); lastval = thisval; } }
None, just work on your project.
The Linux kernel is browsable online.
the FreeBSD kernel is browsable online. You can find sources for a more complete distribution here.
Followup for this week: