慶應義塾大学
2010年度 秋学期

コンピューター・アーキテクチャ
Computer Architecture

2010年度秋学期 火曜日3時限
科目コード: 35010 / 2単位
カテゴリ:
開講場所:SFC
授業形態:講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第11回 12月14日
Lecture 11, December 14: Systems: Distributed-Memory Multiprocessors and Interconnection Networks

Outline of This Lecture

Parallel Programming Tools

We will work with examples in OpenMP and pthreads.

Programming Parallel Systems

The OpenMP "hello, world" example using multiple threads:

Rodney-Van-Meters-MacBook-Pro:particles rdv$ gcc -fopenmp -o openmp-hello openmp-hello.c
Rodney-Van-Meters-MacBook-Pro:particles rdv$ ./openmp-hello 10
hello(1) hello(2) hello(5) hello(7) hello(0) hello(3) hello(4)
hello(6) hello(8) hello(9) world(1) world(2) world(5) world(7) 
world(0) world(3) world(4) world(6) world(8) world(9) 

An example of "reduction" in OpenMP:

Rodney-Van-Meters-MacBook-Pro:particles rdv$ ./openmp-reduction 
thread 0 added in i=0, total now 0.000000
thread 1 added in i=25, total now 25.000000
thread 0 added in i=1, total now 1.000000
thread 1 added in i=26, total now 51.000000
thread 0 added in i=2, total now 3.000000
thread 1 added in i=27, total now 78.000000
thread 0 added in i=3, total now 6.000000
thread 1 added in i=28, total now 106.000000
thread 0 added in i=4, total now 10.000000
thread 1 added in i=29, total now 135.000000
thread 0 added in i=5, total now 15.000000
thread 1 added in i=30, total now 165.000000
thread 0 added in i=6, total now 21.000000
thread 1 added in i=31, total now 196.000000
thread 0 added in i=7, total now 28.000000
thread 1 added in i=32, total now 228.000000
thread 0 added in i=8, total now 36.000000
thread 1 added in i=33, total now 261.000000
thread 0 added in i=9, total now 45.000000
thread 1 added in i=34, total now 295.000000
thread 0 added in i=10, total now 55.000000
thread 1 added in i=35, total now 330.000000
thread 0 added in i=11, total now 66.000000
thread 0 added in i=12, total now 78.000000
thread 0 added in i=13, total now 91.000000
thread 0 added in i=14, total now 105.000000
thread 0 added in i=15, total now 120.000000
thread 0 added in i=16, total now 136.000000
thread 1 added in i=36, total now 366.000000
thread 0 added in i=17, total now 153.000000
thread 1 added in i=37, total now 403.000000
thread 0 added in i=18, total now 171.000000
thread 1 added in i=38, total now 441.000000
thread 0 added in i=19, total now 190.000000
thread 1 added in i=39, total now 480.000000
thread 0 added in i=20, total now 210.000000
thread 1 added in i=40, total now 520.000000
thread 0 added in i=21, total now 231.000000
thread 1 added in i=41, total now 561.000000
thread 0 added in i=22, total now 253.000000
thread 1 added in i=42, total now 603.000000
thread 0 added in i=23, total now 276.000000
thread 1 added in i=43, total now 646.000000
thread 0 added in i=24, total now 300.000000
thread 1 added in i=44, total now 690.000000
thread 1 added in i=45, total now 735.000000
thread 1 added in i=46, total now 781.000000
thread 1 added in i=47, total now 828.000000
thread 1 added in i=48, total now 876.000000
thread 1 added in i=49, total now 925.000000
ave: 24.500000

Review: Shared Memory v. Distributed Memory

Last week, we discussed shared-memory multiprocessors. We began with the Types of Parallel Machines, according to Flynn's taxonomy.

Let's look at the basic hardware layout in the last category. First, the shared-memory architecture:

H-P Fig. 4.1

Then, the distributed-memory architecture:

H-P Fig. 4.2

This week, we are focusing on the latter type.

Distributed Shared Memory

In the diagram above, fairly obviously, the system performance and scalability will be limited by the bus between the processors and the main memory. To achieve better scalability, the memory can be distributed among multiple nodes, and connected to an interconnect, as in the lower picture. If the system allows all CPUs to access the memory at all nodes using a hardware-based mechanism, as if the memory were local, it is called a distributed shared memory architecture. Because the latency to memory depends on the address of the CPU in the network and the address of the memory in the network, these systems can also be called non-uniform memory access (NUMA) architectures.

One of the key issues we discussed last time was cache coherence. We discussed snooping buses and directory-based protocols, focusing on the former. However, snooping buses don't scale well, so DSM systems generally use a directory-based protocol.

H-P Fig. 4.19:
						  hardware-based cache
						  coherence directory
						  for distributed
						  shared memory
Each of the nodes maintains the state for all of the blocks currently in its cache, in a manner almost identical to the shared-memory case:
H-P Fig. 4.21: state
						  diagram for cache blocks

But rather than all nodes receiving changes to the state of every cache block, each memory block has a home directory entry in the cache directory. That directory entry must maintain a list of all nodes that currently have the block cached, and send invalidate messages to them as necessary.

H-P Fig. 4.22: state
						  diagram for cache directory

Interconnect Networks

H-P Fig. E.2: OCN,
						  SAN, LAN, WAN
						  bandwidth and node count

The topology of the network determines a number of charactistics that impact performance:

The table below (from my thesis) lists a few topologies:

table 5.1 from my thesis
figure 5.6 from my thesis

In addition to the topologies shown above, there are other important ones:

Clos networks are defined by three integers n, m, and r. n represents the number of sources which feed into each of r ingress stage crossbar switches. Each ingress stage crossbar switch has m outlets, and there are m centre stage crossbar switches. There is exactly one connection between each ingress stage switch and each middle stage switch. There are r egress stage switches, each with m inputs and n outputs. Each middle stage switch is connected exactly once to each egress stage switch.

Such switched topologies may be either blocking or non-blocking. If mn, the Clos network is rearrangeably nonblocking, meaning that an unused input on an ingress switch can always be connected to an unused output on an egress switch, but for this to take place, existing calls may have to be rearranged by assigning them to different centre stage switches in the Clos network. If m ≥ 2n - 1, the Clos network is strict-sense nonblocking, meaning that an unused input on an ingress switch can always be connected to an unused output on an egress switch, without having to re-arrange existing calls.

Before our understanding of the network is complete, we must know a few things about each link:

Putting it All Together: Blue Gene

One of the most prominent examples of a MIMD multicomputer, or cluster, today is the supercomputer IBM Blue Gene. MTBF time of the largest system installation is reportedly only about 6.16 days (dominated by memory failure)!

The machine at LLNL is a 32x32x64 3D torus: 106,496 dual-processor nodes, 64 terabytes of RAM, several additional, special-purpose networks for global barriers, interrupts, and data reduction (e.g., adding up all of the results), and 1,024 gigabit per second interface to its file system of 806 terabytes. The largest system is believed to be capable of sustained performance of 280 teraflops (2.8*10^14 floating point operations per second).

Blue Gene system (from LLNL)
Blue Gene cabinet (from Wikipedia)
Blue Gene (from Wikipedia)
Blue Gene ASIC (from IBM)

宿題
Homework

This week's homework (last homework! Due 1/11):

All of these problems involve variants of the particles program, available on the Berkeley Parallel Bootcamp exercise page. For each problem, execute for n = 500, 1000, 2000 particles. Plot the execution time. You should execute each value five times and report the mean and standard deviation.

Chalk sketches of the graphs I want are here and here.

The simplest option is probably to do the work on ccx00.sfc.keio.ac.jp and use the OpenMP version of the program, but you can do the last exercise with either pthreads or MPI, if you want, and you can use any machine(s) where you have the proper tools available.

  1. First, the serial version.
  2. Second, the existing version of the pthreads program.
    1. First, for -p 1 (one thread). Compare to the serial version.
    2. Next, for 2, 3, 4, 6, 8, and 16 threads.
  3. Third, the existing version of the OpenMP program.
    1. First, for one thread. Compare to the serial version. (You may have to modify the code to allow you to select the number of threads.)
    2. Next, for 2, 3, 4, 6, 8, and 16 threads.
  4. Pick one of the parallel programs: pthreads, OpenMP, or MPI. Solve the problem stated at the Berkeley Parallel Bootcamp exercise page:
    The existing programs all perform poorly because too much information gets shared around; each per-particle loop looks at all of the other particles, which is unnecessary. Your job is to make the program scale better with the number of particles and processes or threads, by reducing the number of particles that each one examines.

Next Lecture

Next week, we will continue with the discussion of multiprocessors for a third week, focusing on distributed-memory systems and synchronization primitives.

Next lecture:

第11回 12月21日
Lecture 11, December 21: Systems: Distributed-Memory Multiprocessors and Interconnection Networks

Follow-up from this lecture:

Additional Information

その他