The all-important question: what does this beagle have to do with computer architecture?
In the last several weeks, in the pipeline, cache and memory architecture discussions, we saw how large the penalties can be for a cache miss: potentially, hundreds of clock cycles. A modern microprocessor is like a drag racer: it goes great in a straight line, but don't try to turn it! If you want it to change directions, you must let it know well in advance.
In the first lecture, we discussed Amdahl's Law, which tells us how much improvement we can expect by parallelizing the execution of some chunks of work.
Amdahl's Law applies in a variety of ways (including superscalar or multiple-issue processors, and I/O systems), but in this lecture we are interested in how it affects processing using multiple microprocessors.
Below is a graph from a paper [Buehrer et al., MSPC'06] showing the speedup of a particular application versus the number of processors. Notice that the speedup is essentially linear for small numbers of processors (up to eight or so), then begins to saturate for larger numbers. At 32 processors, the speedup of the best algorithm is only 22 or so, only about 2/3 of the potential.
Long ago, a brilliant researcher divided computer architectures up into four classes. This division is now known as Flynn's taxonomy ( 日本語のWikipedia):
Let's look at the basic hardware layout in the last category. First, the shared-memory architecture:
Then, the distributed-memory architecture:
In this session, we will focus on the first type, and cover the second type next week.
In line with the above, programming languages and libraries can be divided into several categories:
In the figure above, all of the processors can share all of the memory. Physical memory addresses are defined on the bus, and are the same when viewed from all processors. There are some key points:
The diagram above shows processors capable of accessing shared memory through their caches. This presents a problem, since the data now resides in two places. There are two basic schemes for cache coherence in shared-memory multiprocessors:
In a snooping protocol, a cache block can be in one of three states:
(The state modified may be added to this list to distinguish unmodified but exclusive from changed. This approach has the common acronym MESI.)
When a processor writes a memory location, that address is placed on the bus, and all of the other processors see it and know to invalidate their caches. If the address is put on the bus but the data is not, the protocol is a write invalidate protocol. If the data is always put on the bus along with the address, it is called a write update or write broadcast protocol. The overhead of write update is higher, but the forced consistency of memory simplifies bus requests.
None! You should have learned the key principles in this in the bootcamp exercise.
Next week, we will continue with the discussion of multiprocessors, focusing on distributed-memory systems and synchronization primitives.
Next lecture:
第11回 1月11日
Lecture 11, January 11: Systems: Distributed-Memory Multiprocessors
and Interconnection Networks