慶應義塾大学
2020年度 春学期

コンピューター・アーキテクチャ
Computer Architecture

科目コード: 35010 / 2単位
カテゴリ:
開講場所:SFC
授業形態:講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第10回 7月13日
Lecture 10, July 13: Systems: Distributed-Memory Multiprocessors and Interconnection Networks

Let's go to the moon! The Apollo-11 Guidance Computer (AGC) source code is available on the web! See here.

Outline of This Lecture

Review: Shared Memory v. Distributed Memory

Last time, we discussed shared-memory multiprocessors. We began with the Types of Parallel Machines, according to Flynn's taxonomy.

Let's look at the basic hardware layout in the last category. First, the shared-memory architecture:

H-P Fig. 4.1

Then, the distributed-memory architecture:

H-P Fig. 4.2

This week, we are focusing on the latter type.

Distributed Shared Memory

In the diagram above, fairly obviously, the system performance and scalability will be limited by the bus between the processors and the main memory. To achieve better scalability, the memory can be distributed among multiple nodes, and connected to an interconnect, as in the lower picture. If the system allows all CPUs to access the memory at all nodes using a hardware-based mechanism, as if the memory were local, it is called a distributed shared memory architecture. Because the latency to memory depends on the address of the CPU in the network and the address of the memory in the network, these systems can also be called non-uniform memory access (NUMA) architectures.

One of the key issues we discussed last time was cache coherence. We discussed snooping buses and directory-based protocols, focusing on the former. However, snooping buses don't scale well, so DSM systems generally use a directory-based protocol.

H-P Fig. 4.19:
						  hardware-based cache
						  coherence directory
						  for distributed
						  shared memory
Each of the nodes maintains the state for all of the blocks currently in its cache, in a manner almost identical to the shared-memory case:
H-P Fig. 4.21: state
						  diagram for cache blocks

But rather than all nodes receiving changes to the state of every cache block, each memory block has a home directory entry in the cache directory. That directory entry must maintain a list of all nodes that currently have the block cached, and send invalidate messages to them as necessary.

H-P Fig. 4.22: state
						  diagram for cache directory

Interconnect Networks

An io9 article with photos showing why we need networks.

H-P Fig. E.2: OCN,
						  SAN, LAN, WAN
						  bandwidth and node count

The topology of the network determines a number of charactistics that impact performance:

The table below (from my thesis) lists a few topologies:

table 5.1 from my thesis
figure 5.6 from my thesis

In addition to the topologies shown above, there are other important ones:

Clos networks are defined by three integers n, m, and r. n represents the number of sources which feed into each of r ingress stage crossbar switches. Each ingress stage crossbar switch has m outlets, and there are m centre stage crossbar switches. There is exactly one connection between each ingress stage switch and each middle stage switch. There are r egress stage switches, each with m inputs and n outputs. Each middle stage switch is connected exactly once to each egress stage switch.

Such switched topologies may be either blocking or non-blocking. If mn, the Clos network is rearrangeably nonblocking, meaning that an unused input on an ingress switch can always be connected to an unused output on an egress switch, but for this to take place, existing calls may have to be rearranged by assigning them to different centre stage switches in the Clos network. If m ≥ 2n - 1, the Clos network is strict-sense nonblocking, meaning that an unused input on an ingress switch can always be connected to an unused output on an egress switch, without having to re-arrange existing calls.

Before our understanding of the network is complete, we must know a few things about each link:

Putting it All Together: Fugaku

I am currently putting together a set of ad hoc notes on Fugaku, Japan's new entry into the top of the June 2020 Top500 Supercomputers list.

Putting it All Together: Blue Gene

One of the most prominent examples of a MIMD multicomputer was the supercomputer IBM Blue Gene, IBM's top of the line supercomputer in the mid-2000s. The LLNL machine held the number one position in the Top 500 Supercomputers list for 3.5 years, until 2008.

The machine at LLNL is a 32x32x64 3D torus: 106,496 dual-processor nodes, with a total of 64 terabytes of RAM Besides the main data network, it has several additional, special-purpose networks: one for fast synchronization (global barriers); one for data reduction (e.g., adding up all of the results) and broadcast; an Ethernet for management; and the dedicated I/O nodes had 1,024 gigabit per second interface to its file system of 806 terabytes. The largest system is believed to be capable of sustained performance of 280 teraflops (2.8*10^14 floating point operations per second).

Blue Gene system (from LLNL)
Blue Gene cabinet (from Wikipedia)
Blue Gene (from Wikipedia)
Blue Gene ASIC (from IBM)

The hardware does not support a view of shared memory; this is a pure message-passing machine, in which inter-process and inter-node communication is entirely the responsibility of the programmer. The I/O nodes ran a full Linux kernel, but the compute nodes ran only a basic runtime OS, without full POSIX semantics. Each processor core was generally assigned only a single process (statically, I believe); if the programmer wanted threads, she had to implement them herself using a library. Programming was (is) generally done using MPI.

MTBF time of the largest system installation was reportedly only about 6.16 days (dominated by memory failure)!

Next Lecture

Next two lectures:

第11回 7月14日
Lecture 11, July 14: Arithmetic

第12回 7月16日
Lecture 11, July 16: First I/O lecture

Follow-up from this lecture:

Additional Information

その他