慶應義塾大学
2020年度春学期

コンピューター・アーキテクチャ
Computer Architecture

開講場所：SFC
授業形態：講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第4回 6月22日 Lecture 4, June 22:
Fastest!

Outline of This Lecture

More on Parallelism
- Amdahl's Law and Dependency Graphs Revisited
- Gustafson-Barsis Law
- Reduction
- Synchronization Barriers
Processor Performance Equation
Basic Data Types
Bitfields
Instruction Sets
- The basic idea
- Basic parts of a CPU
- Memory types and uses
- Instruction set classes
- Types of instructions
- Memory addressing
- What an instruction looks like
Homework/課題

Amdahl's Law and Dependency Graphs Revisited

Amdahl's Law

The parallelism achievable is determined by the dependency graph. Creating that graph and scheduling operations to maximize the parallelism and enforce correctness is generally the shared responsibility of the hardware architecture and the compiler.

Let's look at it mathematically:

$\begin{array}{rcl} S = \frac{1}{(1 - P) + P / N} = \frac{N}{(1 - P) N + P} \end{array}$

Question: What is the limit of this as N goes to infinity?

See the description of Amdahl's Law on Wikipedia.

Amdahl's Law can also be applied to serial problems. An example adapted from Wikipedia:

If your car is traveling 50km/h, and you want to travel 100km, how long will it take?

After one hour, your car speeds up to 100km/h. What is your average speed? If your car becomes infinitely fast, what is the average speed? More importantly, what's the minimum time for the complete trip?

Gustafson-Barsis Law

Now go back to the example above. In practice, when your car gets faster, it becomes possible for you to go farther.

For the first hour, your car runs at 50km/h. After one hour, your car speeds up to 100km/h. What's the limit of your average speed if you lengthen your trip?

Gustafson's Law (or the Gustafson-Barsis Law) basically says that parallelism gives you the freedom to make your problem bigger. 25 years ago, we thought that 100,000 processors or 1,000,000 processors was ridiculous, because Amdahl's Law limited their use. Today, systems in that size range are increasingly common, and it's because of Gustafson-Barsis.

See Gustafson's Law on Wikipedia.

The fundamental observation is this:

[I]n practice, the solvable problem size scales with the number of processors.

Reduction

In OpenMP, this can be achieved via something like

#pragma omp parallel for reduction(+:result)
for ( i = 0 ; i < n ; i++ ) {
  result += array[i];
}

Synchronization Barriers

Basic Data Types

Bitfields

MIPS Registers

Review: プロセッサー・パフォマンス定式
The Processor Performance Equation

CPU time =

(seconds )/ program

(Instructions )/ program

(Clock cycles )/ Instruction

(Seconds )/ Clock cycle

Instruction Sets

CPUs execute instructions (命令)
The data for those instructions comes from memory (stack or heap) or registers
An instruction includes an opcode
Instructions may be ALU, data movement, or control flow instructions

命令：基本の概念
Instructions: the Basic Idea

Computers execute instructions, which are usually compiled by a compiler, a piece of software that translates human-readable (usually ASCII) code into computer-readable binary.

コンピューターが命令を実行する。その命令はコンパイラーが人間の読めるプログラムから通訳してある。例えば：

LOAD	R1, A
ADD	R1, R3
STORE	R1, A

This example shows three instructions, to be executed sequentially. The first instruction LOADs a value into register R1 from memory (we will come back to how the value that is loaded into R1 is found in a minute). The second instruction ADDs the contents of register R3 into register R1, then the third instruction STOREs the result into the original memory location.

CPU: the Central Processing Unit

ちょっと抽象的な絵ですが：

簡単に説明すると、CPUはこの機能の部品がある：

Instruction fetcherは命令をメモリーから読む。 The instruction fetcher reads instructions from memory.
Instruction decoderはその命令どのことか、処理する部分。メモリーからデータを読まなければならいかどうかを決める。 The instruction decoder decides what type of instruction is being executed, and fetches data from memory if necessary.
Memory interfaceは命令のために、メモリーを読んだり書いたりする。 The memory interface reads and writes data from memory for the instructions.
RegistersはCPUの中のメモリーです。The registers are the on-chip memory.
ALU, Arithmetic and Logic Unit,は数学と論理の命令を実行する。 The ALU is the Arithmetic and Logic Unit actually executes, as the name says, arithmetic and logical instructions.

メモリー（記録）：レジスター、スタック、ヒープ
Memory: Registers, Stacks, and Heaps

Registers: Special memory inside the CPU chip. There are typically only a few registers in a CPU. They are fast, but expensive.
Main Memory: Random Access Memory (RAM) is the largest amount of memory in your system; you may have 512 megabytes or more in your laptop. RAM is typically used in several ways:
- Stack: Also called a push-down stack, this area of memory is used to keep values used as local variables by functions in the program.
- Heap: Memory allocated to hold global variables for the program.
- Binary/text segment: The program itself.

It is the job of the compiler to decide how to use the registers, stack, and heap most efficiently. Note that these functions apply to both user programs, or applications, and the operating system kernel.

命令の種類
Types of Instructions

ALU
- Integer arithmetic
- Bitfield and logical operations
- Floating point arithmetic (often handled in separate part of the computer known as an FPU, or floating point unit)
Data load/store, stack push/pop
Control flow
- Unconditional branch
- Conditional branch
- Function call/return

Classes of Instructions Sets

Stack architecture
Accumulator architecture
General-purpose register architecture
- Register-memory
- Load-store

The diagram below shows how data flows in the CPU, depending on the class of instruction set. (TOS = Top of Stack)

differences in data
flow for instruction
set classes

Memory Addressing

Each operand of an instruction must be fetched before the instruction can be executed. Data may come from

Immediate or literal data (limited to less than a full word)
Registers
Register indirect
Register displacement

(There are other addressing modes, as well, which we will not discuss.)

Immediate	ADD R4,#3	Regs[R4] ← Regs[R4] + 3
Register	ADD R4,R3	Regs[R4] ← Regs[R4] + Regs[R3]
Register Indirect	ADD R4, (R1)	Regs[R4] ← Regs[R4] + Mem[Regs[R1]]
Displacement	ADD R4, 100(R1)	Regs[R4] ← Regs[R4] + Mem[100+Regs[R1]]

Depending on the instruction, the data may be one of several sizes (using common modern terminology):

A byte (8 bits, today)
A half word (16 bits)
A word (32 bits)
A double word (64 bits)

What an Instruction Looks Like

An instruction must contain the following:

An opcode, the operation code that identifies the instruction.
Address type for zero or more arguments (sometimes, implicit in the instruction).
Addresses (or immediate data) for zero or more arguments.

Some architectures always use the same number of arguments, others use variable numbers. In some architectures, the addressing information and address are always the same length; in others they are variable.

In general, the arithmetic instructions are either two address or three address. Two-address operations modify one of the operands, e.g.

ADD R1, R3	; R1 = R1 + R3

whereas three-address operations specify a separate result register, e.g.

ADD R1, R2, R3	; R3 = R1 + R2

(n.b.: in some assembly languages, the target is specified first; in others, it is specified last.)

The MIPS architecture, developed in part by Professors Patterson and Hennessy, is relatively easy to understand. Its instructions are always 32 bits, of which 6 bits are the opcode (giving a maximum of 64 opcodes). rs and rt are the source and target registers, respectively. (Those fields are five bits; how many registers can the architecture support?) Instructions are one of three forms:

宿題
Homework

See the PDF file.

Install the MIPS simulator software on your laptop. There are several versions available. You can get Jim Larus's canonical version from SourceForge.

I recommend you try the MMIPSS simulator for Android. There is a tutorial available here or here.

There is also math homework; please check the PDF file above!

Next Lecture

Next lecture:

第4回 6月16日実験的な並列化
Lecture 4, June 16: Experimental Parallelism

Additional Information

日本語のWikipedia on Flynn's taxonomy
The Landscape of Parallel Computing Research: A View from Berkeley, Technical Report No. UCB/EECS-2006-183, December 2006.
Berkeley's Parallel Boot Camp
Intel's Academic Community (a fantastic resource for parallel computing)
Intel's 日本語 education page
Class top page
Elsevier's web page for the textbook.
My web page on system software.

コンピューター・アーキテクチャ Computer Architecture

第4回 6月22日 Lecture 4, June 22: Fastest!