慶應義塾大学
2007年度 秋学期
コンピューター・アーキテクチャ
Computer Architecture
第6回 12月3日
Lecture 7, December 3: Systems: Shared-Memory Multiprocessors
The all-important question: what does this beagle have to do with
computer architecture?
Last week and the week before in the cache and memory architecture
discussions, we saw how large the penalties can be for a cache miss:
potentially, hundreds of clock cycles. A modern microprocessor is
like a drag racer: it goes great in a straight line, but don't try to
turn it! If you want it to change directions, you must let it
know well in advance.
Outline of This Lecture
- Review: Amdahl's Law
- Parallel Processing
- Types of Parallel Machines
- Basics of Sharing a Bus
- Cache Coherence
- Homework
Review: Amdahl's Law
In the first lecture, we discussed Amdahl's
Law, which tells us how much improvement we can expect
by parallelizing the execution of some chunks of work.
Amdahl's Law applies in a variety of ways (including superscalar or
multiple-issue processors, and I/O systems), but in this lecture we
are interested in how it affects processing using multiple
microprocessors.
Parallel Processing
As a concrete example, imagine that you have a system with 100
processors.
Below is a graph from a paper showing the speedup of a
particular application versus the number of processors. Notice that
the speedup is essentially linear for small numbers of
processors (up to eight or so), then begins to saturate for larger
numbers. At 32 processors, the speedup of the best algorithm is
only 22 or so, only about 2/3 of the potential.
Types of Parallel Machines
Long ago, a brilliant researcher divided computer architectures up
into four classes:
- SISD: Single Instruction, Single Data
Basic uniprocessors -- until recently, most PCs and many servers
- SIMD: Single Instruction, Multiple Data
Some beautiful architectures, notably the
- Connection Machine
- vector processors (in a manner of speaking)
- array processors
- digital signal processors (DSPs)
- graphics processors
- also called "data parallel" architectures
- difficult to program
- MISD: Multiple Instruction, Single Data
(No viable architectures here)
- MIMD: Multiple Instruction, Multiple Data
This is what most people think of when they say "parallel
computer". There are two basic models:
- Shared-memory multiprocessors
- Distributed-memory multiprocessors
- (There are a number of variations on these themes, such as
non-uniform memory access (NUMA,) architectures)
Let's look at the basic hardware layout in the last category.
First, the shared-memory architecture:
Then, the distributed-memory architecture:
In this session, we will focus on the first type, and cover the
second type next week.
Parallel Programming Models
In line with the above, programming languages and libraries can be
divided into several categories:
- Data parallel languages
- Shared memory (multithreaded)
- Distributed memory (message passing)
Basics of Sharing a Bus
In the figure above, all of the processors can share all of the
memory. Physical memory addresses are defined on the bus, and are the
same when viewed from all processors. There are some key points:
- Buses generally support only one transaction at a time (modulo
pipelining).
- A processor that wants access to memory must arbitrate
for access to the bus. The fairness of the bus arbitration
scheme is an important property; some buses use priority
schemes, others use round-robin, either strict or loose.
- Bus bandwidth may be the limiting factor on system performance.
- Memory bandwidth or latency may be the limiting factor on system
performance.
- For electrical reasons, as buses get longer or have more things
attached to them, their bandwidth goes down, so there is strong
incentive to keep the number small.
Cache Coherence
The diagram above shows processors capable of accessing shared
memory through their caches. This presents a problem, since
the data now resides in two places. There are two basic schemes
for cache coherence in shared-memory multiprocessors:
- Directory based: the information about sharing of data
is kept in one centralized location (e.g., a special-purpose
memory).
- Snooping: information about the state of the cache is
distributed among the processors, who maintain their information by
listening, or snooping, on the bus.
In a snooping protocol, a cache block can be in one of three
states:
- shared (read only)
- exclusive (read/write)
- invalid (uncached)
(The state modified may be added to this list to
distinguish unmodified but exclusive from changed. This approach
has the common acronym MESI.)
When a processor writes a memory location, that address is placed
on the bus, and all of the other processors see it and know to
invalidate their caches. If the address is put on the bus but the
data is not, the protocol is a write invalidate protocol. If
the data is always put on the bus along with the address, it is
called a write update or write broadcast protocol. The
overhead of write update is higher, but the forced consistency of
memory simplifies bus requests.
宿題
Homework
This week's homework (submit via email):
- Plot the expected speedup for a parallel system as the number of
processors increases. Assume the workload is 1 unit of
unparallelizable work, followed by 1000 units of parallelizable
work.
- First, plot the absolute speedup as 1 to 1000 processors are
used. I recommend a log-log plot.
- Next, plot the efficiency, as a percentage: the speedup
divided by the number of processors used.
- Do problem 4.1 in the textbook. Note: the body of Chapter 4
refers to Exclusive, Shared, and Invalid
states, but this problem refers to Modified, Shared,
and Invalid. Assume that Modified means the same thing
as the Exclusive state we discussed in class. (If you do not have
the book, come to my desk and I will give you a copy of the
pages.)
- Do problem 4.5 in the textbook: draw the state diagram that is
the equivalent of Figure
4.7 in the textbook, but with both Modified and Exclusive
states. In this problem, Exclusive means that only one processor
has the data in its cache, but that copy is "clean". (Note that
Fig. 4.7 is essentially the combined left and right diagrams
in Fig. 4.6.)
Next Lecture
Next week, we will continue with the discussion of multiprocessors,
focusing on distributed-memory systems and synchronization
primitives.
Next lecture:
第8回 12月10日
Lecture 8, December 10: Systems: Distributed-Memory Multiprocessors
and Interconnection Networks
- Follow-up from this lecture: Sections 4.1, 4.2, and 4.3
- For next time: Sections 4.4 and 4.5
Additional Information
その他