慶應義塾大学
2011年度 秋学期

コンピューター・アーキテクチャ
Computer Architecture

2011年度秋学期 火曜日3時限
科目コード: 35010 / 2単位
カテゴリ:
開講場所:SFC
授業形態:講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第2回 10月07日 Lecture 2, October 07:
Faster!

Outline of This Lecture

Contacting Me/Office Hours
連絡先/オフィスアワー

If you need to contact me, email is the preferred method. Please put "COMP-ARCH:" in the Subject field of the email. If I do not respond to a query within 24 hours, please resend. For more urgent matters, junsec should know how to get ahold of me.

Office Hours, Fall 2011秋のオフィスアウアー:Wednesday (水曜日), 9-12, Delta N211. You may come to my office during this time without an appointment. If you wish to see me otherwise, you can attempt to find me directly, or send me email to arrange an appointment.

What's a Computer?

What's in a Computer?

(Here's the fun part...)

Wikipedia
						  motherboard block
						  diagram

Our computer

Armstrong back panel Armstrong front panel Armstrong top view Armstrong memory Armstrong PCI slots Armstrong disk

定量てきなデザイン概念
Quantitative Principles of Design

Last time, we talked about Hennessy & Patterson's Five Principles:

  1. Take Advantage of Parallelism
  2. Principle of Locality
  3. Focus on the Common Case
  4. Amdahl's Law
  5. The Processor Performance Equation
I would add to this one imperative: Achieve Balance.

Take Advantage of Parallelism

Parallelism can be found by using multiple processors on different parts of the problem, or multiple functional units (floating point units, disk drives, etc.), or by pipelining, dividing an individual computer instruction into several parts and executing the parts of different instructions at the same time in different parts of the CPU.

Principle of Locality

Programs and data tend to reuse data and instructions that have been recently used. There are two forms of locality: spatial and temporal. Locality is what allows a cache memory to work.

Focus on the Common Case

The things that are done a lot should be fast; the things that are rare may be slow.

Amdahl's Law

Amdahl's Law tells us how much improvement is possible by making the common case fast, or by parallelizing part of the algorithm. In the example below, 3/5 of the algorithm can be parallelized, meaning that three times as much hardware applied to the problem gains us only a reduction from five time units to three.

Example of Amdahl's Law, parallel and
				serial portions.

Some problems, most famously graphics, are known as "embarrassingly parallel" problems, in which extracting parallelism is trivial, and performance is primarily determined by input/output bandwidth and the number of processing elements available. More generally, the parallelism achievable is determined by the dependency graph. Creating that graph and scheduling operations to maximize the parallelism and enforce correctness is generally the shared responsibility of the hardware architecture and the compiler.

Dependency graph for the
					     above figure.

プロセッサー・パフォマンス定式
The Processor Performance Equation

CPU time = (seconds )/ program = (Instructions )/ program × (Clock cycles )/ Instruction × (Seconds )/ Clock cycle

宿題
Homework

Last week, the assignment was to recreate the graphs shown in class. Probably, that assignment was alarmingly vague. The description on SFC-SFS should be better now.

The source you need, including script files, are available in a tar file here.

n.b.: Some of the parameters in the code have changed from what was online last week!!! Please use this version.

The specification for OpenMP, and a "summary card" for C and C++, are available here. The latest version is 3.1, but there is a Japanese version of the 3.0 spec available. 最新のバージョンは3.1だが、3.0の日本語版はあり ますよ!

This week's homework (submit via SFS, due 10/21):

  1. Change the compiler from gcc to icc, Intel's C compiler. Replot the data, putting both sets of data (gcc and icc) on the plot. How much faster does it get? Is the speedup the same for all problem sizes?
  2. In architecture/src/qulib/sim.c, you will find the functions cnot() and Hadamard(). In the statement
    /* XXX parallelizing this loop is tricky, but it's a "big" loop, so worth doing... */
    #pragma omp parallel for schedule(static,8192) private(j,k,z)
    

    the number 8192 indicates the size of the chunk of the large array that each thread executes. Change that number both smaller and larger in both functions to see the effects. Save these as separate data sets, and plot them all together on one plot.
    1. First, eliminate the "schedule" altogether; try it with
      #pragma omp parallel for private(j,k,z)
      
    2. Next, try it with schedule(static,16).
    3. schedule(static,256).
    4. schedule(static,1024).
    5. schedule(static,4096).
    6. schedule(static,16384).
    7. By now, you should have some idea of what values will work well. Choose the optimal value for the schedule size for this application and machine.
  3. Read the text for next week.

Next Lecture

Next lecture:

第3回 10月14日 プロセッサー:命令の基本
Lecture 3, October 14: Processors: Basics of Instruction Sets

以下で、P-Hはコンピュータの構成と設計~ハードウエアとソフトウエアの インタフェース 第3版、 H-Pはコンピュータアーキテクチャ 定量的アプローチ 第4版.

Below, P-H is Computer Organization and Design: The Hardware-Software Interface, and H-P is Computer Architecture: A Quantitative Approach.

Readings for next time:

Additional Information

その他