慶應義塾大学
2008年度秋学期

コンピューター・アーキテクチャ
Computer Architecture

2008年度秋学期　火曜日3時限
科目コード: 35010 / 2単位
カテゴリ:
開講場所：SFC
授業形態：講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第9回 12月2日 Lecture 9, December 2:
Systems: Shared-Memory Multiprocessors
共有メモリ・マルチプロセッサー

The all-important question: what does this beagle have to do with computer architecture?

In the last several weeks, in the pipeline, cache and memory architecture discussions, we saw how large the penalties can be for a cache miss: potentially, hundreds of clock cycles. A modern microprocessor is like a drag racer: it goes great in a straight line, but don't try to turn it! If you want it to change directions, you must let it know well in advance.

Outline of This Lecture

Review: Amdahl's Law
Parallel Processing
Types of Parallel Machines
Basics of Sharing a Bus
Cache Coherence
Homework

Review: Amdahl's Law

In the first lecture, we discussed Amdahl's Law, which tells us how much improvement we can expect by parallelizing the execution of some chunks of work.

Example of Amdahl's Law, parallel and
serial portions.

Amdahl's Law applies in a variety of ways (including superscalar or multiple-issue processors, and I/O systems), but in this lecture we are interested in how it affects processing using multiple microprocessors.

Parallel Processing

Below is a graph from a paper showing the speedup of a particular application versus the number of processors. Notice that the speedup is essentially linear for small numbers of processors (up to eight or so), then begins to saturate for larger numbers. At 32 processors, the speedup of the best algorithm is only 22 or so, only about 2/3 of the potential.

Speedup graph from Buehrer et al.,
MSPC'06

Types of Parallel Machines

Parallel Hardware Architecture Classes
- SISD: Single Instruction, Single Data
- SIMD: Single Instruction, Multiple Data
- MISD: Multiple Instruction, Single Data
- MIMD: Multiple Instruction, Multiple Data
  - Shared-memory multiprocessors
  - Distributed-memory multiprocessors
Parallel Programming
- Shared Memory
- Message Passing

Long ago, a brilliant researcher divided computer architectures up into four classes:

SISD: Single Instruction, Single Data
Basic uniprocessors -- until recently, most PCs and many servers
SIMD: Single Instruction, Multiple Data
Some beautiful architectures, notably the
- Connection Machine
- vector processors (in a manner of speaking)
- array processors
- digital signal processors (DSPs)
- graphics processors
- also called "data parallel" architectures
- difficult to program
MISD: Multiple Instruction, Single Data
(No viable architectures here)
MIMD: Multiple Instruction, Multiple Data
This is what most people think of when they say "parallel computer". There are two basic models:
- Shared-memory multiprocessors
- Distributed-memory multiprocessors
- (There are a number of variations on these themes, such as non-uniform memory access (NUMA,) architectures)

Let's look at the basic hardware layout in the last category. First, the shared-memory architecture:

Then, the distributed-memory architecture:

In this session, we will focus on the first type, and cover the second type next week.

Parallel Programming Models

In line with the above, programming languages and libraries can be divided into several categories:

Data parallel languages
Shared memory (multithreaded)
Distributed memory (message passing)

Basics of Sharing a Bus

In the figure above, all of the processors can share all of the memory. Physical memory addresses are defined on the bus, and are the same when viewed from all processors. There are some key points:

Buses generally support only one transaction at a time (modulo pipelining).
A processor that wants access to memory must arbitrate for access to the bus. The fairness of the bus arbitration scheme is an important property; some buses use priority schemes, others use round-robin, either strict or loose.
Bus bandwidth may be the limiting factor on system performance.
Memory bandwidth or latency may be the limiting factor on system performance.
For electrical reasons, as buses get longer or have more things attached to them, their bandwidth goes down, so there is strong incentive to keep the number small.

Cache Coherence

The diagram above shows processors capable of accessing shared memory through their caches. This presents a problem, since the data now resides in two places. There are two basic schemes for cache coherence in shared-memory multiprocessors:

Directory based: the information about sharing of data is kept in one centralized location (e.g., a special-purpose memory).
Snooping: information about the state of the cache is distributed among the processors, who maintain their information by listening, or snooping, on the bus.

In a snooping protocol, a cache block can be in one of three states:

shared (read only)
exclusive (read/write)
invalid (uncached)

(The state modified may be added to this list to distinguish unmodified but exclusive from changed. This approach has the common acronym MESI.)

When a processor writes a memory location, that address is placed on the bus, and all of the other processors see it and know to invalidate their caches. If the address is put on the bus but the data is not, the protocol is a write invalidate protocol. If the data is always put on the bus along with the address, it is called a write update or write broadcast protocol. The overhead of write update is higher, but the forced consistency of memory simplifies bus requests.

宿題
Homework

This week's homework (submit via SFS):

Plot the expected speedup for a parallel system as the number of processors increases. Assume the workload is 1 unit of unparallelizable work, followed by 1000 units of parallelizable work.
1. First, plot the absolute speedup as 1 to 1000 processors are used. I recommend a log-log plot.
2. Next, plot the efficiency, as a percentage: the speedup divided by the number of processors used.

Next Lecture

Next week, we will continue with the discussion of multiprocessors, focusing on distributed-memory systems and synchronization primitives.

Next lecture:

第8回 12月9日
Lecture 10, December 9: Systems: Distributed-Memory Multiprocessors and Interconnection Networks

Follow-up from this lecture: Sections 4.1, 4.2, and 4.3
For next time: Sections 4.4 and 4.5

Additional Information

The Landscape of Parallel Computing Research: A View from Berkeley, Technical Report No. UCB/EECS-2006-183, December 2006.
Class top page
Elsevier's web page for the textbook.
My web page on system software.

コンピューター・アーキテクチャ Computer Architecture

第9回 12月2日 Lecture 9, December 2: Systems: Shared-Memory Multiprocessors 共有メモリ・マルチプロセッサー