慶應義塾大学
2012年度 秋学期

コンピューター・アーキテクチャ
Computer Architecture

2012年度秋学期 金曜日5限 (Fall 2012, Fridays 5th period)
  科目コード: 35010 / 2単位
カテゴリ:
開講場所:SFC
授業形態:講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第2回 10月05日 Lecture 2, October 05:
Faster!

Outline of This Lecture

Performance Measurement

Last week, we discussed some performance graphs, plotting (a) wall-clock time, (b) speedup, and (c) efficiency versus the number of threads used on a particular problem. This week, each of you will see how to take that data, and how to recreate the graphs. You need R, an account on armstrong.sfc.wide.ad.jp, and the code available from the link below at Homework.

  1. Get the tarball onto armstrong.
  2. Unpack it using tar xvfz.
  3. Confirm that two of the parameters are reasonable. Look at the top of main.c, where you should see
    #define LOOP 1000
    #define REGISTER_SIZE 20
    
    Edit the file and fix those parameters, if they are different.
  4. Build both the library and the program hw1 using make.
  5. Execute time ./hw1 and record the results in a file that looks like
    1 440 435 0.2
    2 229 451 0.36
    3 161 475 0.42
    4 122 482 0.4
    
  6. You will want to run this with various parameters, including the number of threads, possibly the number of loops, and the register size.
  7. Back on your laptop, in R, using the file fun.R or by hand, you can plot the results using commands like
    > ARMDATA1 <- matrix(scan("armstrong-one-run.dat"),ncol=4,byrow=T)
    Read 48 items
    > plot(ARMDATA1[,1],ARMDATA1[,2])
    > x <- seq(1,16)
    > y = 440/x
    > points(x,y,type="l")
    > help(plot)
    > plot(ARMDATA1[,1],ARMDATA1[,2],log="y")
    > points(x,y,type="l")
    > plot(ARMDATA1[,1],ARMDATA1[,2],log="xy")
    > points(x,y,type="l")
    

定量てきなデザイン概念
Quantitative Principles of Design

Let's talk about Hennessy & Patterson's Five Principles:

  1. Take Advantage of Parallelism
  2. Principle of Locality
  3. Focus on the Common Case
  4. Amdahl's Law
  5. The Processor Performance Equation
I would add to this one imperative: Achieve Balance.

Take Advantage of Parallelism

Parallelism can be found by using multiple processors on different parts of the problem, or multiple functional units (floating point units, disk drives, etc.), or by pipelining, dividing an individual computer instruction into several parts and executing the parts of different instructions at the same time in different parts of the CPU.

Principle of Locality

Programs and data tend to reuse data and instructions that have been recently used. There are two forms of locality: spatial and temporal. Locality is what allows a cache memory to work.

Focus on the Common Case

The things that are done a lot should be fast; the things that are rare may be slow.

Amdahl's Law

Amdahl's Law tells us how much improvement is possible by making the common case fast, or by parallelizing part of the algorithm. In the example below, 3/5 of the algorithm can be parallelized, meaning that three times as much hardware applied to the problem gains us only a reduction from five time units to three.

Example of Amdahl's Law, parallel and
				serial portions.

Some problems, most famously graphics, are known as "embarrassingly parallel" problems, in which extracting parallelism is trivial, and performance is primarily determined by input/output bandwidth and the number of processing elements available. More generally, the parallelism achievable is determined by the dependency graph. Creating that graph and scheduling operations to maximize the parallelism and enforce correctness is generally the shared responsibility of the hardware architecture and the compiler.

Dependency graph for the
					     above figure.

プロセッサー・パフォマンス定式
The Processor Performance Equation

CPU time = (seconds )/ program = (Instructions )/ program × (Clock cycles )/ Instruction × (Seconds )/ Clock cycle

What's a Computer?

IBM blade server IBM Lenovo desktop OLPC iPhone Adidas shoe with embedded process

What's in a Computer?

(Here's the fun part...)

Wikipedia
						  motherboard block
						  diagram

Our computer

Armstrong back panel Armstrong front panel Armstrong top view Armstrong memory Armstrong PCI slots Armstrong disk

Let's go visit it!

宿題
Homework

The source you need, including script files, are available in a tar file here.

The specification for OpenMP, and a "summary card" for C and C++, are available here. The latest version is 3.1, but there is a Japanese version of the 3.0 spec available. 最新のバージョンは3.1だが、3.0の日本語版はあり ますよ!

This week's homework (submit via SFS, due 10/12):

  1. Change the compiler from gcc to icc, Intel's C compiler. Replot the data, putting both sets of data (gcc and icc) on the plot. How much faster does it get? Is the speedup the same for all problem sizes?
  2. In architecture/src/qulib/sim.c, you will find the functions cnot() and Hadamard(). In the statement
    /* XXX parallelizing this loop is tricky, but it's a "big" loop, so worth doing... */
    #pragma omp parallel for schedule(static,8192) private(j,k,z)
    

    the number 8192 indicates the size of the chunk of the large array that each thread executes. Change that number both smaller and larger in both functions to see the effects. Save these as separate data sets, and plot them all together on one plot.
    1. First, eliminate the "schedule" altogether; try it with
      #pragma omp parallel for private(j,k,z)
      
    2. Next, try it with schedule(static,16).
    3. schedule(static,256).
    4. schedule(static,1024).
    5. schedule(static,4096).
    6. schedule(static,16384).
    7. By now, you should have some idea of what values will work well. Choose the optimal value for the schedule size for this application and machine.
  3. Plot the results with different numbers of threads and the different compilers in a set of graphs like the ones from last week. You can put all of the data on one graph (giving you three graphs, total) or on separate graphs.

Next Lecture

Next lecture:

第3回 10月12日 プロセッサー:命令の基本
Lecture 3, October 12: Processors: Basics of Instruction Sets

以下で、P-Hはコンピュータの構成と設計~ハードウエアとソフトウエアの インタフェース 第3版、 H-Pはコンピュータアーキテクチャ 定量的アプローチ 第4版.

Below, P-H is Computer Organization and Design: The Hardware-Software Interface, and H-P is Computer Architecture: A Quantitative Approach.

Readings for next time:

Additional Information

その他