慶應義塾大学
2012年度春学期

システム・ソフトウェア
System Software / Operating Systems

2012年度春学期　火曜日2時限
科目コード: 60730
開講場所：SFC
授業形態：講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第11回 6月29日入出力
Lecture 11, June 29: Input/Output Systems

Outline

What's a Disk Drive?
- The Importance of a Disk Drive
- The Access Time Gap
- The Insides of a Disk Drive
- Disk Drive Trends
Talking to the Hardware
- Buses and I/O Ports
- Device Types
- Doing the Talking
Device Drivers
- Character and Block Devices
- Top Half and Bottom Half
Naming
Performance
- Data Copies in the File System and Elsewhere
- Interrupt Rate
- Coalescing Interrupts
Tools

Today's Photos

What's a Disk Drive?

A disk drive stores data in sectors that held on tracks; all of the tracks at the same distance from the spindle are called a cylinder.

It uses a read/write head attached to a slider, mounted on an actuator arm, to read and write the data as it spins past.

The Importance of a Disk Drive

In an architectural sense, what's important about disk drives?

They are expensive
They consume lots of power
They are often the performance bottleneck (the access time gap)
They break more easily than many other parts of the system

...and yet, the Information Revolution （情報革命？） can fairly be said to be built on disk drives. Without them, there would be no PCs, no Google.

Disk drive industry shipments, in terabytes

The Access Time Gap

The Insides of a Disk Drive

Disk Drive Trends

Hard drive capacity over time (Wikipedia)

Talking to the Hardware

In order to understand I/O, we need to briefly review the hardware architecture...

Buses and I/O Ports

Systems generally consist of multi-level attachments that provide differing types of aggregation. Some physical devices sit on, or close to, the main memory bus; others are kept more distant via some sort of controller.

There are a number of common types of controller:

Serial
Parallel
SCSI
ATA
USB
Firewire
...

As it gets easier to put more hardware into the devices themselves, they exhibit more complex behavior, including helping the OS identify them. We will come back to that later. First, some examples of the device types:

Device Types

Okay, so what kind of devices are we actually talking to? There are many, many kinds of I/O devices. Here are the classes that SCSI defines:

00h - direct-access device (e.g., magnetic disk)
01h - sequential-access device (e.g., magnetic tape)
02h - printer device
03h - processor device
04h - write-once device
05h - CDROM device
06h - scanner device
07h - optical memory device (e.g., some optical disks)
08h - medium Changer (e.g. jukeboxes)
09h - communications device
0Ah-0Bh - defined by ASC IT8 (Graphic arts pre-press devices)
0Ch - Storage array controller device (e.g., RAID)
0Dh - Enclosure services device
0Eh - Simplified direct-access device (e.g., magnetic disk)
0Fh - Optical card reader/writer device
10h - Reserved for bridging expanders
11h - Object-based Storage Device
12h - Automation/Drive Interface
13h-1Dh - reserved
1Eh - Well known logical unit
1Fh - unknown or no device type

Here are the USB device types:

00h - Use class information in the Interface Descriptors
01h - Audio
02h - Communications and CDC Control
03h - HID (Human Interface Device)
05h - Physical
06h - Image
07h - Printer
08h - Mass Storage
09h - Hub
0Ah - CDC-Data
0Bh - Smart Card
0Dh - Content Security
0Eh - Video
DCh - Diagnostic Device
E0h - Wireless Controller
EFh - Miscellaneous
FEh - Application Specific
FFh - Vendor Specific

And neither of these lists includes graphics devices such as the monitor or graphics display itself. Other devices requiring similar I/O control may include specialized processors, and of course all manner of scientific equipment.

Obviously, all of this requires a lot of software; in Linux, there are almost 3,300 different device drivers! Don't worry, the complexity is actually quite manageable; we'll come back to that when we discuss the drivers themselves below.

Doing the Talking

...okay. Now, how do we talk to a device? There are two basic ways for the CPU to talk to hardware devices:

I/O Instructions
Memory-Mapped I/O

When using I/O instructions, the CPU executes an IN or OUT instruction, which reads from or writes to a separate address space (namespace) for I/O devices, usually attached to a separate bus.

Those methods refer to how the CPU talks to, or controls the device. In both cases, there are two primary ways to get your actual data out:

Programmed I/O
Direct Memory Access (DMA)

DMA involves setting up some other piece of hardware besides the CPU to actually control the I/O and move the data from the device into main memory. DMA may be done to virtual or physical addresses. The primary advantage of virtual is that it supports scatter/gather I/O. These days, most device controllers support scatter/gather directly for physical addresses anyway, and with the MMU incorporated into the CPU it's a little harder to use the address translation hardware, so it's not commonly done any more.

Device Drivers

So far, we've hardly said a word about the operating system. The device driver is the primary piece of the OS that is responsible for managing I/O.

As you might expect from the initial discussion of hardware, there are several levels of device drivers, starting with software to control the actual buses and going on down to the devices. The bus drivers are used more or less as a library of functions for the actual device drivers.

In most modern systems, the device driver that matches a particular device can be loaded as a kernel module after the device is identified by the OS.

A device driver must follow a particular form, which is very dependent on the operating system. Over the last several years, there has been a push for OS-independent device drivers, so that OS developers can share the same code for a device independent of whether it was developed for Windows, Linux, or Mac.

Character and Block Devices

Very early on in the development of Unix, the authors made a brilliant decision: they devided hardware devices into two classes, the block devices and the character devices. The primary difference between the two is that file systems can be mounted on block devices, requiring additional functions from the device driver.

Top Half and Bottom Half

In Unix, the code for a device driver is divided into the top half and the bottom half. (The bottom half is usually much less than half of the total code, though.) The bottom half is essentially the interrupt handler, and it must be prepared to run at any time, with the system in any state. The top half generally runs with the system set to the state (e.g., memory map) of the process that is scheduling (or has scheduled) the I/O.

Naming

Devices used to be named strictly according to their bus address, which was simple and never changed. In today's Plug-and-Play (PNP) world, that's simply not so. Morever, different flash drives can be plugged into the same slot and use the same address (over time), but the OS should treat them differently!

Ideally, devices would always identify themselves completely. Most devices provide some identification, but those that store data could, and should, make more effective use of the volume name, which is generally embedded in the device.

Performance

The overall performance goal, as we discussed in the lecture on process scheduling, is generally balance between keeping the CPU busy and keeping the devices busy. For the moment, we are really only concerned with how to achieve the highest I/O rate (measured in throughput or I/O operations per second).

The principle reasons that I/O slows down are:

The CPU being late on an operation, resulting in the device stalling.
Data copies in the file system and elsewhere.
Interrupt handling (which leads to CPU load and stalling the device).

Device Stalls

For a disk drive, a common form of disk performance problem is a rotational miss. Disks also must seek, and poor choices in ordering seeks can ruin your performance, but we don't have time to go into that right now.

For tape drives, underflowing or overflowing a buffer results in a tape stall, which is extremely expensive.

Data Copies in the File System and Elsewhere

A couple of weeks ago we talked about file system APIs. At one point, we talked about the alignment of application file read/write buffers. In modern C/Unix APIs, the buffer can be any place in memory, but in older systems, buffers always had to be aligned to the size of the system memory page.

In Unix systems, it is also true that disk I/Os are done in multiples of a page size, and the I/O is also done to page boundaries. So how are the API and the I/O system reconciled? Through the file system buffer cache. The buffer cache serves two important purposes: the first is alignment, and the second is buffering, to allow speed matching of I/O and allow the application to continue while I/O is handled by the kernel on its behalf.

Packets arrive into the system in a variety of sizes. Worse, in general, you don't know which process (if any!) wants the packet until you get it into memory and examine the headers.

These effects cumulatively mean that data copies are common in operating systems, and they have an enormous impact on system performance:

They tie up the processor itself
They tie up the memory bus
They pollute the cache

Interrupt Rate and Coalescing Interrupts

How many interrupts per second do you get from a 100Mbps Ethernet card with 1500B frames? What about a 10Gbps Ethernet card with minimum-size frames?

Tools

On Linux, and on some other Unix systems, the following tools are useful:

vmstat (vm_stat on Mac, with different semantics)
lspci
iostat
dmesg

You may be interested in the following benchmarks:

bonnie++
lmbench

On most modern Intel (and AMD?) processors, the following code should give you the number of clock cycles since the OS last rebooted:

#include <stdio.h>

#if defined(__i386__)

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long long int x;
     __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
     return x;
}
#elif defined(__x86_64__)


static __inline__ unsigned long long rdtsc(void)
{
  unsigned hi, lo;
  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

#elif defined(__powerpc__)


static __inline__ unsigned long long rdtsc(void)
{
  unsigned long long int result=0;
  unsigned long int upper, lower,tmp;
  __asm__ volatile(
                "0:                  \n"
                "\tmftbu   %0           \n"
                "\tmftb    %1           \n"
                "\tmftbu   %2           \n"
                "\tcmpw    %2,%0        \n"
                "\tbne     0b         \n"
                : "=r"(upper),"=r"(lower),"=r"(tmp)
                );
  result = upper;
  result = result<<32;
  result = result|lower;

  return(result);
}

#endif

main()
{
  int i;
#define MAX 100
  unsigned long long int vals[MAX];
  unsigned long long int lastval = 0, thisval;

  for ( i = 0 ; i < MAX ; i++ )
	  vals[i] = 1;
  for ( i = 0 ; i < MAX ; i++ )
	  vals[i] = rdtsc();
  for ( i = 0 ; i < MAX ; i++ )
	  printf("value: %llu delta: %llu\n",vals[i],
		 i ? vals[i]-vals[i-1] : 0);

  printf("=====\n");

  for ( i = 0 ; i < 100 ; i++ ) {
	  thisval = rdtsc();
	  printf("value: %llu delta: %llu\n",
		 thisval, thisval - lastval);
	  lastval = thisval;
  }

}

Homework, Etc.

Pick one of the I/O papers from the Wisconsin Advanced OS class page, write a 2-3 page report. That's papers 29 to 39 here.
Read The All-IP Manifesto, a rough set of notes I wrote up about where systems are going. Write a 1-page report on it. Note: next week you will be assigned to read one or more of the papers referenced there; reading this will help you decide which ones!

Next Lecture

Next lecture:

Followup for this week:

Rubini and Corbet, Linux Device Drivers, O'Reilly.
How Stuff Works has a great page on buses.
The Intel Architecture Software Developer's Manual Volume 1 has a short chapter on I/O.
The Linux I/O Port Programming mini-HOWTO has some useful code for getting very high-resolution timing information on a Pentium, accurate to individual clock cycles.
USB device classes.
The Intel White Paper on QuickPath Interconnect (QPI) available here is very good, has figures should the evolution of PC hardware architectures.
AMD's alternative proposal for revamping data movement architectures is HyperTransport.
Although the technology as described in this Ars Technica article is a bit of a moving target, it gives you the idea of what's happening.

システム・ソフトウェア System Software / Operating Systems

第11回 6月29日 入出力 Lecture 11, June 29: Input/Output Systems