慶應義塾大学
2007年度秋学期

コンピューター・アーキテクチャ
Computer Architecture

2007年度秋学期　月曜日3時限
科目コード: XXX / 2単位
カテゴリ:
開講場所：SFC
授業形態：講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

Homework Solutions

Note: these are totally complete yet, but will be shortly. Not all of the problems have exact solutions, they will depend on your choices and on the computer you try to run them on.

試験は英語と日本語両方で書いてある。

By Lecture

第1回 10月1日
Lecture 1, October 1: Fundamentals of Computer Design
第2回 10月15日
Lecture 2, October 15: Processors: Basics of Instruction Sets
第3回 10月22日
Lecture 3, October 22: Processors: Arithmetic
第4回 11月5日
Lecture 4, November 5: Basics of Pipelining
第5回 11月12日
Lecture 5, November 12: Memory: Caching and Memory Hierarchy
第6回 11月19日
Lecture 6, November 19: Memory: Virtual Memory
第7回 12月3日
Lecture 7, December 3: Systems: Shared-Memory Multiprocessors
第8回 12月10日
Lecture 8, December 10: Systems: Distributed-Memory Multiprocessors
第9回 12月17日
Lecture 9, December 17: I/O Systems and Supercomputing Architectures
第10回 1月7日
Lecture 10, January 7: Systems: Multi-Core Chips
第11回 1月18日 金曜日！ 入出力
Lecture 11, January 18 (n.b.: Friday!): Basics of I/O and
第12回 1月21日
Lecture 12, January 21: RAID: Parallelism and

Lecture 1

Demonstrate that you have a working compiler setup where you will be able to write programs for class. Capture the output of the compilation and execution of a simple program (such as "Hello, world") and email it to me. Include the amount of time it took you to complete this exercise.
```
#include <stdio.h>

main()
{
	printf("Hello, world\n");
}
```

Determine all of the information that you can about the hardware of your own PC:

CPU manufacturer, type, and clock speed.

System dependent, but as an example:

Mine is a 2GHz Intel Core 2 Duo, with two processors in the chip.

From my Linux machine:

[rdv@localhost ~]$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping        : 6
cpu MHz         : 1000.000
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips        : 3994.36
clflush size    : 64

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping        : 6
cpu MHz         : 1000.000
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips        : 3990.09
clflush size    : 64

Type, amount, and speed of memory.

My system has 1GB of DRAM (will have to check the type...)

Size of cache memories.

(from dmesg on a Linux machine, at startup)
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 4096K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0

and

CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 4096K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1

I believe that the L2 cache is shared.

Type, size, and speed of disk (and its I/O interface).

Mine is a Hitachi 160GB serial ATA (SATA) drive.

ata1: SATA max UDMA/133 cmd 0x000101f0 ctl 0x000103f6 bmdma 0x000118b0 irq 14
ata2: PATA max UDMA/100 cmd 0x00010170 ctl 0x00010376 bmdma 0x000118b8 irq 15
usb 5-1: configuration #1 chosen from 1 choice
ata1.00: ATA-7: Hitachi HTS541616J9SA00, SB4OC70P, max UDMA/100
ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (not used)
ata1.00: configured for UDMA/100
scsi 0:0:0:0: Direct-Access     ATA      Hitachi HTS54161 SB4O PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or 
FUA
sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or 
FUA

Can you find out the chip set type?

Determine the exact version of the operating system and compiler you are using.

[rdv@localhost Desktop]$ uname -a
Linux localhost.localdomain 2.6.23.9-85.fc8 #1 SMP Fri Dec 7 15:49:59 EST 2007 i686 i686 i386 GNU/Linux
[rdv@localhost Desktop]$ gcc --version
gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

If you have additional thoughts about the above discussion of programming experience, please email me.
Read the text for next week.

All problems are system-dependent, there is no "correct" answer.

Lecture 2

Take your "hello, world" program from last time and compile to assembly code.
System dependent, but as an example:
```
	.file	"hello.c"
	.section	.rodata
.LC0:
	.string	"Hello, world"
	.text
.globl main
	.type	main, @function
main:
	leal	4(%esp), %ecx
	andl	$-16, %esp
	pushl	-4(%ecx)
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ecx
	subl	$4, %esp
	movl	$.LC0, (%esp)
	call	puts
	addl	$4, %esp
	popl	%ecx
	popl	%ebp
	leal	-4(%ecx), %esp
	ret
	.size	main, .-main
	.ident	"GCC: (GNU) 4.1.2 20070925 (Red Hat 4.1.2-27)"
	.section	.note.GNU-stack,"",@progbits
```
1. What type of processor is your system?
  System dependent, but as an example:
  
  Mine is a 2GHz Intel Core 2 Duo, with two processors in the chip.
  
  See above.
2. Find a copy of the Instruction Set Reference Manual for your processor (it should be available online somewhere).
  Intel's processor manuals are available here.
3. What class of instruction set is your processor, load-store, register-memory, or accumulator? (It's unlikely to be a pure stack machine.)
  
  x86-based (also called IA-32) architectures, as discussed on page J-45 of Appendix J of the textbook, are extended accumulator instruction set machines. The original 8080 was a straight accumulator instruction set, but over the years many registers have been added. Unfortunately, they are not completely interchangeable; some instructions can use operands in only some of the registers.
4. Determine if your stack grows up or down. (You may need to use a debugger for this.)
  Stacks on most modern machines, including x86 architectures, grow down.
```
int b(int x)
{
	return x+2;
};

void a(int x)
{
	b(x);
}


main()
{
	a(14);
}
```
```
[rdv@localhost architecture]$ gdb stack
GNU gdb Red Hat Linux (6.6-40.fc8rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) start
Breakpoint 1 at 0x80483c3: file stack.c, line 14.
Starting program: /home/rdv/keio/sfc/teaching/architecture/stack 

warning: Missing the separate debug info file: /usr/lib/debug/.build-id/ac/2eeb206486bb7315d6ac4cd64de0cb50838ff6.debug

warning: Missing the separate debug info file: /usr/lib/debug/.build-id/ba/4ea1118691c826426e9410cafb798f25cefad5.debug
main () at stack.c:14
14              a(14);
(gdb) p $sp
$1 = (void *) 0xbf9c02a0
(gdb) step
a (x=14) at stack.c:8
8               b(x);
(gdb) p x
$2 = 14
(gdb) p &x
$3 = (int *) 0xbf9c02a0
(gdb) p $sp
$4 = (void *) 0xbf9c0294
(gdb) step
b (x=14) at stack.c:3
3               return x+2;
(gdb) p x
$5 = 14
(gdb) p &x
$6 = (int *) 0xbf9c0294
(gdb) 
```
5. How many different opcodes are there for your processor?
  
  Depending on the exact form, "hundreds" for the x86.
  
  How long is the opcode field in an instruction?
  
  Again, in x86, variable. Some are only 8 bits, others are wider.
6. How many different addressing modes does your processor have?
  Again from Appendix J,
  - absolute
  - register indirect
  - based
  - indexed
  - based indexed with displacement
  - based with scaled indexed
  - based with scaled indexed and displacement
  You can see, from the large number of opcodes, restrictions on use of registers, numerous addressing modes, and many, many variants of the basic architecture, that writing a good compiler for an IA-32 machine is very hard!
Read the text for next week.

Lecture 3

This week's homework (submit via email):

Tell me what decimal values the following floating point numbers represent:
Remember this image from the lecture: These values can be played with directly inside a debugger, such as gdb:
```
(gdb) list
3	main()
4	{
5	  float a = 1.1, b = 1.2;
6	  int i;
7	
8	  for ( i = 0 ; i < 100000000 ; i++ ) {
9	    a = expf(1.0);
10	    a*=b;
11	    b*=a;
12	    a*=b;
(gdb) ptype a
type = float
(gdb) whatis a
type = float
(gdb) x/wt &a
0xbf98f228:	01000000010010010000111111010000
(gdb) set var a=3.1415926
(gdb) x/wt &a
0xbf98f228:	01000000010010010000111111011010
(gdb) p a
$3 = 3.1415925
(gdb) 
```
1. 00000000000000000000000000000000
  0.0
2. 00111111100000000000000000000000
  1.0
  sign (符号部) (one bit): 0
  significand (仮数部) (twenty-three bits): 0b00000000000000000000000 = 1.0....(binary) (ひとつのビットはかくれていると覚えて)(remember the hidden bit)
  exponent (指数部) (eight bits): 0b01111111 (バイアスは127で覚えて)(remember the bias of 127)
3. 10111111100000000000000000000000
  -1.0
  sign (符号部) (one bit): 1
  significand (仮数部) (twenty-three bits): 0b00000000000000000000000 = 1.0....(binary) (ひとつのビットはかくれていると覚えて)(remember the hidden bit)
  exponent (指数部) (eight bits): 0b01111111 (バイアスは127で覚えて)(remember the bias of 127)
4. 00111111000000000000000000000000
  0.5
  sign (符号部) (one bit): 1
  significand (仮数部) (twenty-three bits): 0b00000000000000000000000 = 1.0....(binary) (ひとつのビットはかくれていると覚えて)(remember the hidden bit)
  exponent (指数部) (eight bits): 0b01111110 (バイアスは 127で覚えて)(remember the bias of 127)
5. 01000000000000000000000000000000
  2.0
  sign (符号部) (one bit): 0
  significand (仮数部) (twenty-three bits): 0b10000000000000000000000=1.0...(binary) (ひとつのビットはかくれていると覚えて)(remember the hidden bit)
  exponent (指数部) (eight bits): 0b10000000 (バイアスは127で覚えて)(remember the bias of 127)
6. 01000000010000000000000000000000
  3.0
  sign (符号部) (one bit): 0
  significand (仮数部) (twenty-three bits): 0b00000000000000000000000=1.10...(binary) (ひとつのビットはかくれていると覚えて)(remember the hidden bit)
  exponent (指数部) (eight bits): 0b10000000 (バイアスは127で覚えて)(remember the bias of 127)
7. 01000000010000000000000000000001
  3.0+2^(-23)
```
(gdb) set var b = 3.0000002
(gdb) x/wt &b
0xbf98f22c:	01000000010000000000000000000001
```
8. 01000000010010010000111111011010
  3.1415926

Write a program to test the speed of your processor on integer arithmetic. Test the four basic functions: add, subtract, multiply, divide. If you are using a Unix-like machine (or Cygwin), you may use the "time" command.

Here is an example for integer addition:

[rdv@dhcp-143-221 architecture]$ cat int.c
main()
{
  int a = 7, b = 2, c = 5, i;

  for ( i = 0 ; i < 100000000 ; i++ ) {
    c += a;
    a += b;
    b += c;
    a += a;
    a += a;
    b += b;
    c += a;
    a += b;
    b += a;
    c += a;
    c += a;
    a += b;
    b += c;
    a += a;
    a += a;
    b += b;
    c += a;
    a += b;
    b += a;
    c += a;
    a = 7;
    b = 2;
    c = 5;
  }
}

    
[rdv@dhcp-143-221 architecture]$ gcc -o int int.c
[rdv@dhcp-143-221 architecture]$ time ./int

real    0m3.839s
user    0m3.828s
sys     0m0.010s
[rdv@dhcp-143-221 architecture]$ time ./int

real    0m3.859s
user    0m3.846s
sys     0m0.011s
[rdv@dhcp-143-221 architecture]$ time ./int

real    0m3.815s
user    0m3.809s
sys     0m0.003s

The inner loop is ten instructions, and the loop is executed 100,000,000 times, so it runs 2 billion integer additions. 2E9 / 3.8 seconds = 525M integer additions/second. On a 2GHz processor, that is one result every 3.8 clock cycles.

Similar programs are easily written for the other operations.

Write a program to test the speed of your processor on floating-point arithmetic. Test the four basic functions: add, subtract, multiply, divide.

[rdv@dhcp-143-221 architecture]$ cat fp.c
#include <math.h>

main()
{
  float a = 1.1, b = 1.2;
  int i;

  for ( i = 0 ; i < 100000000 ; i++ ) {
    a*=b;
    b*=a;
    a*=b;
    b*=a;
    a*=b;
    b*=a;
    a*=b;
    b*=a;
    a*=b;
    b*=a;
    a = 1.1; b = 1.2;
  }
}
[rdv@dhcp-143-221 architecture]$ gcc -o fp fp.c
[rdv@dhcp-143-221 architecture]$ time ./fp

real    0m3.026s
user    0m3.016s
sys     0m0.009s
[rdv@dhcp-143-221 architecture]$ time ./fp

real    0m3.070s
user    0m3.059s
sys     0m0.011s
[rdv@dhcp-143-221 architecture]$ time ./fp

real    0m2.938s
user    0m2.936s
sys     0m0.002s
[rdv@dhcp-143-221 architecture]$ time ./fp

real    0m2.917s
user    0m2.907s
sys     0m0.010s

The inner loop is ten instructions, and the loop is executed 100,000,000 times, so it runs one billion floating point multiplies. One billion FP multiplies in 3 seconds is 3.3E8 FP multiplies/second, or one every 6.6 clock cycles with a 2.0GHz processor.

However, there are a bunch of data dependencies in that code: if we double the number of FP multiplies, the amount of time less than doubles.

[rdv@dhcp-143-221 architecture]$ cat fp2.c
#include <math.h>

main()
{
        float a = 1.1, b = 1.2, c = 1.3, d = 1.4;
  int i;

  for ( i = 0 ; i < 100000000 ; i++ ) {
    a*=b;
    c*=d;
    b*=a;
    d*=c;
    a*=b;
    c*=d;
    b*=a;
    d*=c;
    a*=b;
    c*=d;
    b*=a;
    d*=c;
    a*=b;
    c*=d;
    b*=a;
    d*=c;
    a*=b;
    c*=d;
    b*=a;
    d*=c;
    a = 1.1; b = 1.2; c = 1.3; d = 1.4;
  }
}
[rdv@dhcp-143-221 architecture]$ gcc -o fp2 fp2.c
[rdv@dhcp-143-221 architecture]$ time ./fp2

real    0m5.073s
user    0m5.068s
sys     0m0.002s
[rdv@dhcp-143-221 architecture]$ time ./fp2

real    0m5.142s
user    0m5.139s
sys     0m0.001s
[rdv@dhcp-143-221 architecture]$ time ./fp2

real    0m5.085s
user    0m5.079s
sys     0m0.002s

In that example, 2 billion instructions are executed in 5 seconds, for 4E8 FP multiplies/second, one every 5 clock cycles.

Test the accuracy and precision of floating point arithmetic on your processor.

Notice the line from the example above:

(gdb) set var a=3.1415926
(gdb) x/wt &a
0xbf98f228:	01000000010010010000111111011010
(gdb) p a
$3 = 3.1415925

The last digit is different. That suggests that there is a problem, but at the point it is not identified. It might be in the routine that converts floating point binary to human-readable ASCII.

You can see from the code and output below that it is possible for floating point operations to become inaccurate after only a few dozen divisions, or additions of small numbers to large ones.

[rdv@dhcp-143-221 architecture]$ cat double-double-double.cpp 
// first, the standard libraries
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <set>
#include <map>
#include <cmath>
#include <iterator>
#include <vector>

#define FALSE 0
#define TRUE 1

main()
{
  float f = 1.0;
  double a = 1.0;
  long double b = 1.0;
  long double c = 1.0;

  int i;
  int reported, reported2;

  std::cout.precision(50);
  std::cout << "float (" << sizeof(f)*8 << "bits):" << std::endl;
  i = 0; reported = FALSE; reported2 = FALSE;
  while (1) {
    float fprime = f/2;
    // n.b.: where this reports will depend on the architecture, the compiler,
    // and the compiler optimization switches!  Might optimize to 1.0-f^2,
    // in which case the question is, when does f^2 == 0
    // but an expression similar to this likely occurs when manipulating
    // fidelities in a quantum simulation, since the fidelities sum to 1.0.
    if (!reported && (1.0-f)*f == f) {
      std::cout << "After " << i << " iterations, ((1.0-f)*f == f) at f = " << f << std::endl;
      reported = TRUE;
    }
    // try f^2 directly.
    // Note that on many machines this will *never* trip because
    // the CPU internal floating format is double, and we'll run out of precision
    // on the divide by two first
    if (!reported2 && f*f == 0) {
      std::cout << "f^2 becomes zero at " << i << " iterations" << std::endl;
      reported2 = TRUE;
    }
    if (f != 2*fprime) {
      std::cout << "After " << i << " iterations: " << f << std::endl;
      break;
    }
    f = fprime;
    i++;
  }
  std::cout << "double (" << sizeof(a)*8 << "bits):" << std::endl;
  i = 0; reported = FALSE; reported2 = FALSE;
  while (1) {
    double aprime = a/2;
    if (!reported && (1.0-a)*a == a) {
      std::cout << "After " << i << " iterations, ((1.0-a)*a == a) at a = " << a << std::endl;
      reported = TRUE;
    }
    // try a^2 directly
    if (!reported2 && a*a == 0) {
      std::cout << "a^2 becomes zero at " << i << " iterations" << std::endl;
      reported2 = TRUE;
    }
    if (a != 2*aprime) {
      std::cout << "After " << i << " iterations: " << a << std::endl;
      break;
    }
    a = aprime;
    i++;
  }
  std::cout << "long double (" << sizeof(b)*8 << "bits):" << std::endl;
  i = 0; reported = FALSE; reported2 = FALSE;
  while (1) {
    long double bprime = b/2;
    if (!reported && (1.0-b)*b == b) {
      std::cout << "After " << i << " iterations, ((1.0-b)*b == b) at b = " << b << std::endl;
      reported = TRUE;
    }
    // try b^2 directly
    if (!reported2 && b*b == 0) {
      std::cout << "b^2 becomes zero at " << i << " iterations" << std::endl;
      reported2 = TRUE;
    }
    if (b != 2*bprime) {
      std::cout << "After " << i << " iterations: " << b << std::endl;
      std::cout << b << std::endl;
      break;
    }
    b = bprime;
    i++;
  }
  std::cout << "long double (" << sizeof(c)*8 << "bits):" << std::endl;
  i = 0; reported = FALSE; reported2 = FALSE;
  while (1) {
    long double cprime = c/2;
    if (!reported && (1.0-c)*c == c) {
      std::cout << "After " << i << " iterations, ((1.0-c)*c == c) at c = " << c << std::endl;
      reported = TRUE;
    }
    // try c^2 directly
    if (!reported2 && c*c == 0) {
      std::cout << "c^2 becomes zero at " << i << " iterations" << std::endl;
      reported2 = TRUE;
    }
    if (c != 2*cprime) {
      std::cout << "After " << i << " iterations: " << c << std::endl;
      std::cout << c << std::endl;
      break;
    }
    c = cprime;
    i++;
  }
}
[rdv@dhcp-143-221 architecture]$ g++ -o double-double-double double-double-double.cpp 
[rdv@dhcp-143-221 architecture]$ ./double-double-double 
float (32bits):
After 65 iterations, ((1.0-f)*f == f) at f = 2.710505431213761085018632002174854278564453125e-20
After 149 iterations: 1.4012984643248170709237295832899161312802619418765e-45
double (64bits):
After 65 iterations, ((1.0-a)*a == a) at a = 2.710505431213761085018632002174854278564453125e-20
After 1074 iterations: 4.9406564584124654417656879286822137236505980261432e-324
long double (96bits):
After 65 iterations, ((1.0-b)*b == b) at b = 2.710505431213761085018632002174854278564453125e-20
b^2 becomes zero at 8223 iterations
After 16445 iterations: 3.6451995318824746025284059336194198163990508156936e-4951
3.6451995318824746025284059336194198163990508156936e-4951
long double (96bits):
After 65 iterations, ((1.0-c)*c == c) at c = 2.710505431213761085018632002174854278564453125e-20
c^2 becomes zero at 8223 iterations
After 16445 iterations: 3.6451995318824746025284059336194198163990508156936e-4951
3.6451995318824746025284059336194198163990508156936e-4951

Lecture 4

Find and describe a real-world pipeline. Include:
Choice dependent
1. The number of stages
2. Functionality of each stage
3. Interlocking between stages
4. Any hazards
5. How balance in execution time is maintained
We noted right at the end of class that pipeline hazards equate to arrows flowing right to left on the figure above. Identify the arrows on the diagram above by type and indicate the maximum delay that the hazard can cause.
The upper reverse arrow is a control hazard, and the lower one is a data hazard. Both delay the pipeline four clock cycles.
The three pipeline programs we "executed" during class today are linked to below. Calculate the following for each:
1. The number of instructions that must be executed. Don't forget to account for the loop in program 3. (n.b.: the #-28 in the branch is decimal!)
2. The number of clock cycles the entire program takes, accounting for data and control hazards.
3. The average clock cycles per instruction (CPI) for the three programs.
```
First program:
10 instructions
no hazards
time = 10 clock cycles + 4 for pipeline delay = 14 clock cycles
CPI = 14/10 = 1.4
```
Second program:

The hazards have to be worked out carefully, but the normal delay caused by the data hazard is two clock cycles.
```
10 instructions
one data hazard, instruction 0x814, +2 clock cycles
time = 14+2 = 16 clock cycles
CPI = 16/10 = 1.6
```
Third program:

As with data hazards, the normal delay caused by the control hazard is two clock cycles.

n.b.: There is a bug in my program! The loop is seven instructions long (from the MUL at 0x814 to the BEQZ at 0x82C). It was supposed to be executed three times (with R1=2, 1, 0) but the instruction should have been a BNEZ instead of BEQZ. If you used either, you will get full credit.

Taking the branch:
```
5+3*7+2 = 28 instructions
one control hazard, instruction 0x82C, +2 clock cycles
per taken loop (twice!)
time = 28 instructions issued + 2 stalls * 2 cycles + 4 for pipeline
      latency = 36 clock cycles
CPI = 36/28 = 1.29
```
Not taking the branch:
```
5+1*7+2 = 14 instructions
no control hazards
time = 14 instructions issued + 0 stalls * 2 cycles + 4 for pipeline
      latency = 18 clock cycles
CPI = 18/14 = 1.29
```

Lecture 5

Clearly we need more practice with hexadecimal notation and bit fields. Assuming the above example of the AMD Opteron processor, with 40-bit physical addresses, identify the tag, index, and block offset of the following addresses:

0x1000

0x1000 =

0000

0001

0000

tag (25 bits):

0000

0___

____

= 0x0

index (9 bits):

____

_001

0000

00__

____

= 0x40

block offset (6 bits):

____

__00

0000

= 0x0

0x1004

0x1004 =

0000

0001

0000

0100

tag (25 bits):

0000

0___

____

= 0x0

index (9 bits):

____

_001

0000

00__

____

= 0x40

block offset (6 bits):

____

__00

0100

= 0x04

0x1044

0x1044 =

0000

0001

0000

0100

tag (25 bits):

0000

0___

____

= 0x0

index (9 bits):

____

_001

0000

01__

____

= 0x41

block offset (6 bits):

____

__00

0100

= 0x04

0x1049

0x1049 =

0000

0001

0000

0100

1001

tag (25 bits):

0000

0___

____

= 0x0

index (9 bits):

____

_001

0000

01__

____

= 0x41

block offset (6 bits):

____

__00

1001

= 0x09

0x81231000

0x81231000 =

0000

1000

0001

0010

0011

0001

0000

tag (25 bits):

0000

1000

0001

0010

0011

0___

____

= 0x10246

index (9 bits):

____

_001

0000

00__

____

= 0x40

block offset (6 bits):

____

__00

0000

= 0x0

0xA1239004

0xA1239004 =

0000

1010

0001

0010

0011

1001

0000

0100

tag (25 bits):

0000

1010

0001

0010

0011

1___

____

= 0x14247

index (9 bits):

____

_001

0000

00__

____

= 0x40

block offset (6 bits):

____

__00

0100

= 0x04

0xA1239913

0xA1239913 =

0000

1010

0001

0010

0011

1001

0001

0011

tag (25 bits):

0000

1010

0001

0010

0011

1___

____

= 0x14247

index (9 bits):

____

_001

1001

00__

____

= 0x40

block offset (6 bits):

____

__01

0011

= 0x13

Assume a system with a 1GHz system clock, only one level of cache, and that L1 access is 1 clock cycle and main memory is 50 clock cycles.
1. Plot the average memory access time as a function of hit rate.
```
gnuplot> load "avg-mem-access-time.gp"
gnuplot> 
[1]+  Stopped                 gnuplot
[rdv@new-host architecture]$ cat avg-mem-access-time.gp
reset
clear
set xlabel "Cache Hit Rate"
set ylabel "nsec"
set title "Average Memory Access Time"
set term post eps mono "Helvetica" 24
set out "avg-mem-access-time.eps"
plot [0:1.0] 1+(1-x)*49 notitle
```
2. Plot the expected completion time for one billion instructions as a function of cache hit rate.
  Same graph, but multiplied by one billion.

Lecture 6

Write down the equation for calculating the average memory access time for a two-level cache.
If the equation for one level of cache is

Avg. memory access time = Hit time +
Miss rate × Miss penalty

then the equation for two levels must be something like

Avg. memory access time = (L1 hit rate) × (L1 Hit time) +
(L2 hit rate) × (L2 Hit time) +
(Miss rate) × (main memory latency)
Following the example from our lecture on pipelining, suggest a modification to the five-stage pipeline to accommodate the memory architecture above.
I'm afraid this question is not very well written. My apologies. It's not even clear which "above" memory architecture I meant.
1. How many stages must the pipeline have to support the L1 cache?
2. When data must be retrieved from L2 cache, how many cycles will the processor stall?
3. Using one of the examples from Fig. 5.28 from the text, how many cycles will the processor stall to retrieve data from main memory?
Suggest an alternative to waiting for main memory. Can the processor achieve useful work while waiting?
This is the area of much computer architecture research. Ideas include multi-threaded architectures, out-of-order execution, etc.

Lecture 7

Plot the expected speedup for a parallel system as the number of processors increases. Assume the workload is 1 unit of unparallelizable work, followed by 1000 units of parallelizable work.
Note that this example is not very realistic; it's rare to get an application that scales this simply.
1. First, plot the absolute speedup as 1 to 1000 processors are used. I recommend a log-log plot.
  Although I recommended a log-log plot, on second thought, it looks better on a linear plot.
  If there is only one processor, 1001 units of time will be required. With more than one, 1+1000/x time units are required.
```
gnuplot> plot [1:1000] 1001/(1+1000/x)
```
2. Next, plot the efficiency, as a percentage: the speedup divided by the number of processors used.
Do problem 4.1 in the textbook. Note: the body of Chapter 4 refers to Exclusive, Shared, and Invalid states, but this problem refers to Modified, Shared, and Invalid. Assume that Modified means the same thing as the Exclusive state we discussed in class. (If you do not have the book, come to my desk and I will give you a copy of the pages.)
Coming soon.
Do problem 4.5 in the textbook: draw the state diagram that is the equivalent of Figure 4.7 in the textbook, but with both Modified and Exclusive states. In this problem, Exclusive means that only one processor has the data in its cache, but that copy is "clean". (Note that Fig. 4.7 is essentially the combined left and right diagrams in Fig. 4.6.)
Coming soon.

Lecture 8

Consider the 32x32x64 torus of Blue Gene described above. Calculate the degree, diameter, bisection, and average distance of the network.
Degree: 6
Diameter: since it's a torus, in each axis you can only be half the total distance away.
32/2+32/2+64/2 = 64
Bisection: To cut the machine in half, you must cut 32x32x2 = 2048 links.
Average distance: 32. The "average" distance is a slightly funny concept, because it makes assumptions about the traffic pattern. However, in a regular mesh or torus, the average is half the diameter.
A common subroutine in very large simulations calls for matrix transposition. Assume you are using the 65,536 nodes of the 32x32x64 torus to hold 1,048,576x1,048,576 (2^20x2^20) matrix to transpose. Assume each entry is 128 bits (16 bytes).
1. First, assuming each node holds the same amount of data, how much is in each one?
  The total amount of data is 2^44 bytes, or 16TB (terabytes). We have 2^16 nodes, so each processor must hold 2^(44-16) = 2^28 = 256MB, a very reasonable number.
2. If each node holds several columns of the matrix, how much data must it transmit to other nodes in order to perform the matrix transpose? (Hint: almost all...)
3. How many messages must each node send?
  Each node must send one message to each other node, so 65,536-1 = 65,535 messages.
4. Assume each link is 1Gbps, and there is no overhead for a message. How long will the transpose take?
  Our 256MB is 2.14*10^9 bits:
```
octave:4> 256*1024*1024*8
ans =  2.1475e+09
```
  To send that much data along one link would require 2.1475 seconds. Fortunately, since we are sending a message to every node in the system, each message will be routed separately, and we can use all six links from each node.
```
octave:6> 256*1024*1024*8/1000000000/6
ans =  0.35791
```
  Unfortunately, we are not talking to our nearest neighbors. Each message must cross an average of 32 links, as noted above. Thus, our actual final time is
```
octave:7> 256*1024*1024*8/1000000000/6*32
ans =  11.453
```
  about 11.5 seconds.
5. Assume that each link is 1Gbps and that each message has an overhead of 10 microseconds. How long will the transpose take?
  Via the same logic as above, we have 65,535 messages. Each message requires an extra 10usec, so our extra total transmit time is 65535x10usec = 655.35 milliseconds, for a total of 2.1475+0.65535 = 2.8 seconds if we were sending that data on a single link.
```
octave:8> (256*1024*1024*8/1000000000+0.65535)/6*32
ans =  14.948
```

Lecture 9

No homework was assigned.

Lecture 10

Intel says their 80-core processor runs at 1TFLOPS when the clock speed is 3.13GHz. Determine:

FLOPS per processor, assuming all 80 cores are working
1TFLOPS / 80 processors = 12.5GFLOPS per processor.
Average floating point ops/clock cycle

12.5GFLOPS / 3.13 GHz = 4 floating point ops per clock cycle! How can that be? Can it really be more than one operation per clock cycle? Looking at the diagram of the chip, we see

The polygon with the "X" in it is a floating point multiplier, and the one with a "+" in it is a floating point adder. Note that there are two sets of those. Thus, to reach four FP ops/second, all four of those must produce a result every clock cycle!
If each pipeline stall costs the full 8 cycles, what percentage of instructions can stall and still meet that performance?
Based on the above, the pipeline can never stall, or only a very small fraction of the time. Basically, the system must keep all four floating point units (FPUs) running at full speed.
If each pipeline stall costs only one clock cycle, what percentage of instructions can stall and still meet that performance?
Based on the above, the pipeline can never stall, or only a very small fraction of the time. Basically, the system must keep all four floating point units (FPUs) running at full speed.

Lecture 11

This table contains a lot of information on many disk drives from 1975 to 1997, but none since. Pick any recent disk drive and fill in the information.

Answer depends on the disk drive you examine.

For the disk drive you have picked, calculate:
1. How long it will take to read the entire disk sequentially?
  Answer depends on the disk drive you examine. My current laptop disk drive is 160GB and runs at about 24 MB/sec. 160,000/24 = 6,700 seconds, about two hours.
2. How long it will take to read the entire disk one 512 byte sector at a time, in random order?
  Answer depends on the disk drive you examine. My current laptop disk drive is 160GB (320M sectors) and runs at about 100 requests per second, 320,000,000 / 100 = 3,200,000 seconds = about a month!
Who controls the specification for each of these types of buses?
1. Frontside bus
  The CPU and chipset manufacturer, i.e., Intel or AMD.
2. Memory bus
  For example, for DDRAM, the specification is created by JEDEC, an electronics industry standardization body with a hundred or so member companies.
3. PCI
  PCI-SIG, founded in 1992. I mentioned that buses that must be standard have a longer life cycle; PCI was not superseded until 2004, by PCI Express, and even today most machines have a PCI bus.
4. SCSI
  T10

Lecture 12

No homework was assigned.

コンピューター・アーキテクチャ
Computer Architecture

Homework Solutions

By Lecture

Lecture 1

Lecture 2

Lecture 3

Lecture 4

Lecture 5

Lecture 6

Lecture 7

Lecture 8

Lecture 9

Lecture 10

Lecture 11

Lecture 12

Additional Information

その他

コンピューター・アーキテクチャ Computer Architecture

Homework Solutions

By Lecture

Lecture 9

Additional Information

その他

コンピューター・アーキテクチャ
Computer Architecture