慶應義塾大学
2012年度 秋学期
コンピューター・アーキテクチャ
Computer Architecture
第2回 10月05日 Lecture 2, October 05:
Faster!
Outline of This Lecture
- Performance Measurement
- Fundamentals of Computer Design
コンピュータデザインの基礎
- Quantitative Principles of Design
定量的なデザイン概念
- What's a Computer?
コンピュータって、何?
- What's in a Computer's Guts?
コンピュータの内蔵は?
- Homework/課題
Performance Measurement
Last week, we discussed some performance
graphs, plotting (a) wall-clock time, (b) speedup, and (c) efficiency
versus the number of threads used on a particular problem. This week,
each of you will see how to take that data, and how to recreate the
graphs. You need R,
an account on armstrong.sfc.wide.ad.jp, and the code available from
the link below at Homework.
- Get the tarball onto armstrong.
- Unpack it using tar xvfz.
- Confirm that two of the parameters are reasonable. Look at the
top of main.c, where you should see
#define LOOP 1000
#define REGISTER_SIZE 20
Edit the file and fix those parameters, if they are different.
- Build both the library and the program hw1
using make.
- Execute time ./hw1 and record the results in a file
that looks like
1 440 435 0.2
2 229 451 0.36
3 161 475 0.42
4 122 482 0.4
- You will want to run this with various parameters, including the
number of threads, possibly the number of loops, and the register
size.
- Back on your laptop, in R, using the file fun.R or by
hand, you can plot the results using commands like
> ARMDATA1 <- matrix(scan("armstrong-one-run.dat"),ncol=4,byrow=T)
Read 48 items
> plot(ARMDATA1[,1],ARMDATA1[,2])
> x <- seq(1,16)
> y = 440/x
> points(x,y,type="l")
> help(plot)
> plot(ARMDATA1[,1],ARMDATA1[,2],log="y")
> points(x,y,type="l")
> plot(ARMDATA1[,1],ARMDATA1[,2],log="xy")
> points(x,y,type="l")
定量てきなデザイン概念
Quantitative Principles of Design
Let's talk about Hennessy & Patterson's Five Principles:
- Take Advantage of Parallelism
- Principle of Locality
- Focus on the Common Case
- Amdahl's Law
- The Processor Performance Equation
I would add to this one imperative: Achieve Balance.
Take Advantage of Parallelism
Parallelism can be found by using multiple processors on different
parts of the problem, or multiple functional units (floating point
units, disk drives, etc.), or by pipelining, dividing an
individual computer instruction into several parts and executing the
parts of different instructions at the same time in different parts of
the CPU.
Principle of Locality
Programs and data tend to reuse data and instructions that have been
recently used. There are two forms of locality: spatial
and temporal. Locality is what allows a cache memory to
work.
Focus on the Common Case
The things that are done a lot should be fast; the things that are
rare may be slow.
Amdahl's Law
Amdahl's Law tells us how much improvement is possible by
making the common case fast, or by parallelizing part of the
algorithm. In the example below, 3/5 of the algorithm can be
parallelized, meaning that three times as much hardware applied to the
problem gains us only a reduction from five time units to three.
Some problems, most famously graphics, are known as "embarrassingly
parallel" problems, in which extracting parallelism is trivial, and
performance is primarily determined by input/output bandwidth and the
number of processing elements available. More generally, the
parallelism achievable is determined by the dependency graph.
Creating that graph and scheduling operations to maximize the
parallelism and enforce correctness is generally the shared
responsibility of the hardware architecture and the compiler.
プロセッサー・パフォマンス定式
The Processor Performance Equation
CPU time = |
(seconds
)/
program
|
= |
(Instructions
)/
program
|
× |
(Clock cycles
)/
Instruction
|
× |
(Seconds
)/
Clock cycle
|
What's a Computer?
What's in a Computer?
(Here's the fun part...)
Our computer
Let's go visit it!
The source you need, including script files, are available in a
tar file here.
The specification for OpenMP, and a "summary card" for C and C++,
are
available here.
The latest version is 3.1, but there is a Japanese version of the
3.0 spec available. 最新のバージョンは3.1だが、3.0の日本語版はあり
ますよ!
This week's homework (submit via SFS, due 10/12):
- Change the compiler from gcc
to icc, Intel's C compiler. Replot the data,
putting both sets of data (gcc and icc) on the plot. How
much faster does it get? Is the speedup the same for all
problem sizes?
- In architecture/src/qulib/sim.c, you will find the
functions cnot() and Hadamard(). In the
statement
/* XXX parallelizing this loop is tricky, but it's a "big" loop, so worth doing... */
#pragma omp parallel for schedule(static,8192) private(j,k,z)
the number 8192 indicates the size of the chunk of the large array
that each thread executes. Change that number both smaller and larger
in both functions to see the effects. Save these as separate
data sets, and plot them
all together on one plot.
- First, eliminate the "schedule" altogether; try it with
#pragma omp parallel for private(j,k,z)
- Next, try it with schedule(static,16).
- schedule(static,256).
- schedule(static,1024).
- schedule(static,4096).
- schedule(static,16384).
- By now, you should have some idea of what values will work
well. Choose the optimal value for the schedule size for
this application and machine.
- Plot the results with different numbers of threads and the
different compilers in a set of graphs like the ones from last
week. You can put all of the data on one graph (giving you three
graphs, total) or on separate graphs.
Next Lecture
Next lecture:
第3回 10月12日 プロセッサー:命令の基本
Lecture 3, October 12: Processors: Basics of Instruction Sets
以下で、P-Hはコンピュータの構成と設計~ハードウエアとソフトウエアの
インタフェース 第3版、
H-Pはコンピュータアーキテクチャ 定量的アプローチ 第4版.
Below, P-H is Computer Organization and Design: The
Hardware-Software Interface, and H-P is Computer Architecture:
A Quantitative Approach.
Readings for next time:
- Follow-up from this lecture:
- P-H: Chapter 1
- H-P: Chapter 1.1 - 1.12
- For next time:
Additional Information
その他