慶應義塾大学
2007年度 秋学期

コンピューター・アーキテクチャ
Computer Architecture

2007年度秋学期 月曜日3時限
科目コード: XXX / 2単位
カテゴリ:
開講場所:SFC
授業形態:講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

Homework Solutions

Note: these are totally complete yet, but will be shortly. Not all of the problems have exact solutions, they will depend on your choices and on the computer you try to run them on.

試験は英語と日本語両方で書いてある。

By Lecture

Lecture 1

  1. Demonstrate that you have a working compiler setup where you will be able to write programs for class. Capture the output of the compilation and execution of a simple program (such as "Hello, world") and email it to me. Include the amount of time it took you to complete this exercise.
    #include <stdio.h>
    
    main()
    {
    	printf("Hello, world\n");
    }
    
  2. Determine all of the information that you can about the hardware of your own PC:
    1. CPU manufacturer, type, and clock speed.
      System dependent, but as an example:

      Mine is a 2GHz Intel Core 2 Duo, with two processors in the chip.

      From my Linux machine:

      [rdv@localhost ~]$ cat /proc/cpuinfo 
      processor       : 0
      vendor_id       : GenuineIntel
      cpu family      : 6
      model           : 15
      model name      : Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
      stepping        : 6
      cpu MHz         : 1000.000
      cache size      : 4096 KB
      physical id     : 0
      siblings        : 2
      core id         : 0
      cpu cores       : 2
      fdiv_bug        : no
      hlt_bug         : no
      f00f_bug        : no
      coma_bug        : no
      fpu             : yes
      fpu_exception   : yes
      cpuid level     : 10
      wp              : yes
      flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
      bogomips        : 3994.36
      clflush size    : 64
      
      processor       : 1
      vendor_id       : GenuineIntel
      cpu family      : 6
      model           : 15
      model name      : Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
      stepping        : 6
      cpu MHz         : 1000.000
      cache size      : 4096 KB
      physical id     : 0
      siblings        : 2
      core id         : 1
      cpu cores       : 2
      fdiv_bug        : no
      hlt_bug         : no
      f00f_bug        : no
      coma_bug        : no
      fpu             : yes
      fpu_exception   : yes
      cpuid level     : 10
      wp              : yes
      flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
      bogomips        : 3990.09
      clflush size    : 64
      
    2. Type, amount, and speed of memory.
      My system has 1GB of DRAM (will have to check the type...)
    3. Size of cache memories.
      (from dmesg on a Linux machine, at startup)
      CPU: L1 I cache: 32K, L1 D cache: 32K
      CPU: L2 cache: 4096K
      CPU: Physical Processor ID: 0
      CPU: Processor Core ID: 0
      
      and
      
      CPU: L1 I cache: 32K, L1 D cache: 32K
      CPU: L2 cache: 4096K
      CPU: Physical Processor ID: 0
      CPU: Processor Core ID: 1
      
      I believe that the L2 cache is shared.
    4. Type, size, and speed of disk (and its I/O interface).
      Mine is a Hitachi 160GB serial ATA (SATA) drive.
      ata1: SATA max UDMA/133 cmd 0x000101f0 ctl 0x000103f6 bmdma 0x000118b0 irq 14
      ata2: PATA max UDMA/100 cmd 0x00010170 ctl 0x00010376 bmdma 0x000118b8 irq 15
      usb 5-1: configuration #1 chosen from 1 choice
      ata1.00: ATA-7: Hitachi HTS541616J9SA00, SB4OC70P, max UDMA/100
      ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (not used)
      ata1.00: configured for UDMA/100
      scsi 0:0:0:0: Direct-Access     ATA      Hitachi HTS54161 SB4O PQ: 0 ANSI: 5
      sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
      sd 0:0:0:0: [sda] Write Protect is off
      sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
      sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or 
      FUA
      sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
      sd 0:0:0:0: [sda] Write Protect is off
      sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
      sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or 
      FUA
      
    5. Can you find out the chip set type?
  3. Determine the exact version of the operating system and compiler you are using.
    [rdv@localhost Desktop]$ uname -a
    Linux localhost.localdomain 2.6.23.9-85.fc8 #1 SMP Fri Dec 7 15:49:59 EST 2007 i686 i686 i386 GNU/Linux
    [rdv@localhost Desktop]$ gcc --version
    gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
    Copyright (C) 2006 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    
  4. If you have additional thoughts about the above discussion of programming experience, please email me.
  5. Read the text for next week.
All problems are system-dependent, there is no "correct" answer.

Lecture 2

  1. Take your "hello, world" program from last time and compile to assembly code.
    System dependent, but as an example:
    	.file	"hello.c"
    	.section	.rodata
    .LC0:
    	.string	"Hello, world"
    	.text
    .globl main
    	.type	main, @function
    main:
    	leal	4(%esp), %ecx
    	andl	$-16, %esp
    	pushl	-4(%ecx)
    	pushl	%ebp
    	movl	%esp, %ebp
    	pushl	%ecx
    	subl	$4, %esp
    	movl	$.LC0, (%esp)
    	call	puts
    	addl	$4, %esp
    	popl	%ecx
    	popl	%ebp
    	leal	-4(%ecx), %esp
    	ret
    	.size	main, .-main
    	.ident	"GCC: (GNU) 4.1.2 20070925 (Red Hat 4.1.2-27)"
    	.section	.note.GNU-stack,"",@progbits
    
    1. What type of processor is your system?
      System dependent, but as an example:

      Mine is a 2GHz Intel Core 2 Duo, with two processors in the chip.

      See above.

    2. Find a copy of the Instruction Set Reference Manual for your processor (it should be available online somewhere).
      Intel's processor manuals are available here.
    3. What class of instruction set is your processor, load-store, register-memory, or accumulator? (It's unlikely to be a pure stack machine.)

      x86-based (also called IA-32) architectures, as discussed on page J-45 of Appendix J of the textbook, are extended accumulator instruction set machines. The original 8080 was a straight accumulator instruction set, but over the years many registers have been added. Unfortunately, they are not completely interchangeable; some instructions can use operands in only some of the registers.

    4. Determine if your stack grows up or down. (You may need to use a debugger for this.)
      Stacks on most modern machines, including x86 architectures, grow down.
      int b(int x)
      {
      	return x+2;
      };
      
      void a(int x)
      {
      	b(x);
      }
      
      
      main()
      {
      	a(14);
      }
      
      
      [rdv@localhost architecture]$ gdb stack
      GNU gdb Red Hat Linux (6.6-40.fc8rh)
      Copyright (C) 2006 Free Software Foundation, Inc.
      GDB is free software, covered by the GNU General Public License, and you are
      welcome to change it and/or distribute copies of it under certain conditions.
      Type "show copying" to see the conditions.
      There is absolutely no warranty for GDB.  Type "show warranty" for details.
      This GDB was configured as "i386-redhat-linux-gnu"...
      Using host libthread_db library "/lib/libthread_db.so.1".
      (gdb) start
      Breakpoint 1 at 0x80483c3: file stack.c, line 14.
      Starting program: /home/rdv/keio/sfc/teaching/architecture/stack 
      
      warning: Missing the separate debug info file: /usr/lib/debug/.build-id/ac/2eeb206486bb7315d6ac4cd64de0cb50838ff6.debug
      
      warning: Missing the separate debug info file: /usr/lib/debug/.build-id/ba/4ea1118691c826426e9410cafb798f25cefad5.debug
      main () at stack.c:14
      14              a(14);
      (gdb) p $sp
      $1 = (void *) 0xbf9c02a0
      (gdb) step
      a (x=14) at stack.c:8
      8               b(x);
      (gdb) p x
      $2 = 14
      (gdb) p &x
      $3 = (int *) 0xbf9c02a0
      (gdb) p $sp
      $4 = (void *) 0xbf9c0294
      (gdb) step
      b (x=14) at stack.c:3
      3               return x+2;
      (gdb) p x
      $5 = 14
      (gdb) p &x
      $6 = (int *) 0xbf9c0294
      (gdb) 
      
    5. How many different opcodes are there for your processor?
      Depending on the exact form, "hundreds" for the x86.
      How long is the opcode field in an instruction?
      Again, in x86, variable. Some are only 8 bits, others are wider.
    6. How many different addressing modes does your processor have?
      Again from Appendix J,
      • absolute
      • register indirect
      • based
      • indexed
      • based indexed with displacement
      • based with scaled indexed
      • based with scaled indexed and displacement
      You can see, from the large number of opcodes, restrictions on use of registers, numerous addressing modes, and many, many variants of the basic architecture, that writing a good compiler for an IA-32 machine is very hard!
  2. Read the text for next week.

Lecture 3

This week's homework (submit via email):

  1. Tell me what decimal values the following floating point numbers represent:
    Remember this image from the lecture:
    e as a floating point number,
			 01000000001011011111100001010100
    These values can be played with directly inside a debugger, such as gdb:
    (gdb) list
    3	main()
    4	{
    5	  float a = 1.1, b = 1.2;
    6	  int i;
    7	
    8	  for ( i = 0 ; i < 100000000 ; i++ ) {
    9	    a = expf(1.0);
    10	    a*=b;
    11	    b*=a;
    12	    a*=b;
    (gdb) ptype a
    type = float
    (gdb) whatis a
    type = float
    (gdb) x/wt &a
    0xbf98f228:	01000000010010010000111111010000
    (gdb) set var a=3.1415926
    (gdb) x/wt &a
    0xbf98f228:	01000000010010010000111111011010
    (gdb) p a
    $3 = 3.1415925
    (gdb) 
    
    1. 00000000000000000000000000000000
      0.0
    2. 00111111100000000000000000000000
      1.0
      sign (符号部) (one bit): 0
      significand (仮数部) (twenty-three bits): 0b00000000000000000000000 = 1.0....(binary) (ひとつのビットはかくれていると覚えて)(remember the hidden bit)
      exponent (指数部) (eight bits): 0b01111111 (バイアスは127で覚えて)(remember the bias of 127)
    3. 10111111100000000000000000000000
      -1.0
      sign (符号部) (one bit): 1
      significand (仮数部) (twenty-three bits): 0b00000000000000000000000 = 1.0....(binary) (ひとつのビットはかくれていると覚えて)(remember the hidden bit)
      exponent (指数部) (eight bits): 0b01111111 (バイアスは127で覚えて)(remember the bias of 127)
    4. 00111111000000000000000000000000
      0.5
      sign (符号部) (one bit): 1
      significand (仮数部) (twenty-three bits): 0b00000000000000000000000 = 1.0....(binary) (ひとつのビットはかくれていると覚えて)(remember the hidden bit)
      exponent (指数部) (eight bits): 0b01111110 (バイアスは 127で覚えて)(remember the bias of 127)
    5. 01000000000000000000000000000000
      2.0
      sign (符号部) (one bit): 0
      significand (仮数部) (twenty-three bits): 0b10000000000000000000000=1.0...(binary) (ひとつのビットはかくれていると覚えて)(remember the hidden bit)
      exponent (指数部) (eight bits): 0b10000000 (バイアスは127で覚えて)(remember the bias of 127)
    6. 01000000010000000000000000000000
      3.0
      sign (符号部) (one bit): 0
      significand (仮数部) (twenty-three bits): 0b00000000000000000000000=1.10...(binary) (ひとつのビットはかくれていると覚えて)(remember the hidden bit)
      exponent (指数部) (eight bits): 0b10000000 (バイアスは127で覚えて)(remember the bias of 127)
    7. 01000000010000000000000000000001
      3.0+2^(-23)
      (gdb) set var b = 3.0000002
      (gdb) x/wt &b
      0xbf98f22c:	01000000010000000000000000000001
      
    8. 01000000010010010000111111011010
      3.1415926
  2. Write a program to test the speed of your processor on integer arithmetic. Test the four basic functions: add, subtract, multiply, divide. If you are using a Unix-like machine (or Cygwin), you may use the "time" command.
    Here is an example for integer addition:
    [rdv@dhcp-143-221 architecture]$ cat int.c
    main()
    {
      int a = 7, b = 2, c = 5, i;
    
      for ( i = 0 ; i < 100000000 ; i++ ) {
        c += a;
        a += b;
        b += c;
        a += a;
        a += a;
        b += b;
        c += a;
        a += b;
        b += a;
        c += a;
        c += a;
        a += b;
        b += c;
        a += a;
        a += a;
        b += b;
        c += a;
        a += b;
        b += a;
        c += a;
        a = 7;
        b = 2;
        c = 5;
      }
    }
    
        
    [rdv@dhcp-143-221 architecture]$ gcc -o int int.c
    [rdv@dhcp-143-221 architecture]$ time ./int
    
    real    0m3.839s
    user    0m3.828s
    sys     0m0.010s
    [rdv@dhcp-143-221 architecture]$ time ./int
    
    real    0m3.859s
    user    0m3.846s
    sys     0m0.011s
    [rdv@dhcp-143-221 architecture]$ time ./int
    
    real    0m3.815s
    user    0m3.809s
    sys     0m0.003s
    

    The inner loop is ten instructions, and the loop is executed 100,000,000 times, so it runs 2 billion integer additions. 2E9 / 3.8 seconds = 525M integer additions/second. On a 2GHz processor, that is one result every 3.8 clock cycles.

    Similar programs are easily written for the other operations.

  3. Write a program to test the speed of your processor on floating-point arithmetic. Test the four basic functions: add, subtract, multiply, divide.
    [rdv@dhcp-143-221 architecture]$ cat fp.c
    #include <math.h>
    
    main()
    {
      float a = 1.1, b = 1.2;
      int i;
    
      for ( i = 0 ; i < 100000000 ; i++ ) {
        a*=b;
        b*=a;
        a*=b;
        b*=a;
        a*=b;
        b*=a;
        a*=b;
        b*=a;
        a*=b;
        b*=a;
        a = 1.1; b = 1.2;
      }
    }
    [rdv@dhcp-143-221 architecture]$ gcc -o fp fp.c
    [rdv@dhcp-143-221 architecture]$ time ./fp
    
    real    0m3.026s
    user    0m3.016s
    sys     0m0.009s
    [rdv@dhcp-143-221 architecture]$ time ./fp
    
    real    0m3.070s
    user    0m3.059s
    sys     0m0.011s
    [rdv@dhcp-143-221 architecture]$ time ./fp
    
    real    0m2.938s
    user    0m2.936s
    sys     0m0.002s
    [rdv@dhcp-143-221 architecture]$ time ./fp
    
    real    0m2.917s
    user    0m2.907s
    sys     0m0.010s
    

    The inner loop is ten instructions, and the loop is executed 100,000,000 times, so it runs one billion floating point multiplies. One billion FP multiplies in 3 seconds is 3.3E8 FP multiplies/second, or one every 6.6 clock cycles with a 2.0GHz processor.

    However, there are a bunch of data dependencies in that code: if we double the number of FP multiplies, the amount of time less than doubles.

    [rdv@dhcp-143-221 architecture]$ cat fp2.c
    #include <math.h>
    
    main()
    {
            float a = 1.1, b = 1.2, c = 1.3, d = 1.4;
      int i;
    
      for ( i = 0 ; i < 100000000 ; i++ ) {
        a*=b;
        c*=d;
        b*=a;
        d*=c;
        a*=b;
        c*=d;
        b*=a;
        d*=c;
        a*=b;
        c*=d;
        b*=a;
        d*=c;
        a*=b;
        c*=d;
        b*=a;
        d*=c;
        a*=b;
        c*=d;
        b*=a;
        d*=c;
        a = 1.1; b = 1.2; c = 1.3; d = 1.4;
      }
    }
    [rdv@dhcp-143-221 architecture]$ gcc -o fp2 fp2.c
    [rdv@dhcp-143-221 architecture]$ time ./fp2
    
    real    0m5.073s
    user    0m5.068s
    sys     0m0.002s
    [rdv@dhcp-143-221 architecture]$ time ./fp2
    
    real    0m5.142s
    user    0m5.139s
    sys     0m0.001s
    [rdv@dhcp-143-221 architecture]$ time ./fp2
    
    real    0m5.085s
    user    0m5.079s
    sys     0m0.002s
    

    In that example, 2 billion instructions are executed in 5 seconds, for 4E8 FP multiplies/second, one every 5 clock cycles.

  4. Test the accuracy and precision of floating point arithmetic on your processor.
    Notice the line from the example above:
    (gdb) set var a=3.1415926
    (gdb) x/wt &a
    0xbf98f228:	01000000010010010000111111011010
    (gdb) p a
    $3 = 3.1415925
    

    The last digit is different. That suggests that there is a problem, but at the point it is not identified. It might be in the routine that converts floating point binary to human-readable ASCII.

    You can see from the code and output below that it is possible for floating point operations to become inaccurate after only a few dozen divisions, or additions of small numbers to large ones.

    [rdv@dhcp-143-221 architecture]$ cat double-double-double.cpp 
    // first, the standard libraries
    #include <fstream>
    #include <iostream>
    #include <sstream>
    #include <string>
    #include <set>
    #include <map>
    #include <cmath>
    #include <iterator>
    #include <vector>
    
    #define FALSE 0
    #define TRUE 1
    
    main()
    {
      float f = 1.0;
      double a = 1.0;
      long double b = 1.0;
      long double c = 1.0;
    
      int i;
      int reported, reported2;
    
      std::cout.precision(50);
      std::cout << "float (" << sizeof(f)*8 << "bits):" << std::endl;
      i = 0; reported = FALSE; reported2 = FALSE;
      while (1) {
        float fprime = f/2;
        // n.b.: where this reports will depend on the architecture, the compiler,
        // and the compiler optimization switches!  Might optimize to 1.0-f^2,
        // in which case the question is, when does f^2 == 0
        // but an expression similar to this likely occurs when manipulating
        // fidelities in a quantum simulation, since the fidelities sum to 1.0.
        if (!reported && (1.0-f)*f == f) {
          std::cout << "After " << i << " iterations, ((1.0-f)*f == f) at f = " << f << std::endl;
          reported = TRUE;
        }
        // try f^2 directly.
        // Note that on many machines this will *never* trip because
        // the CPU internal floating format is double, and we'll run out of precision
        // on the divide by two first
        if (!reported2 && f*f == 0) {
          std::cout << "f^2 becomes zero at " << i << " iterations" << std::endl;
          reported2 = TRUE;
        }
        if (f != 2*fprime) {
          std::cout << "After " << i << " iterations: " << f << std::endl;
          break;
        }
        f = fprime;
        i++;
      }
      std::cout << "double (" << sizeof(a)*8 << "bits):" << std::endl;
      i = 0; reported = FALSE; reported2 = FALSE;
      while (1) {
        double aprime = a/2;
        if (!reported && (1.0-a)*a == a) {
          std::cout << "After " << i << " iterations, ((1.0-a)*a == a) at a = " << a << std::endl;
          reported = TRUE;
        }
        // try a^2 directly
        if (!reported2 && a*a == 0) {
          std::cout << "a^2 becomes zero at " << i << " iterations" << std::endl;
          reported2 = TRUE;
        }
        if (a != 2*aprime) {
          std::cout << "After " << i << " iterations: " << a << std::endl;
          break;
        }
        a = aprime;
        i++;
      }
      std::cout << "long double (" << sizeof(b)*8 << "bits):" << std::endl;
      i = 0; reported = FALSE; reported2 = FALSE;
      while (1) {
        long double bprime = b/2;
        if (!reported && (1.0-b)*b == b) {
          std::cout << "After " << i << " iterations, ((1.0-b)*b == b) at b = " << b << std::endl;
          reported = TRUE;
        }
        // try b^2 directly
        if (!reported2 && b*b == 0) {
          std::cout << "b^2 becomes zero at " << i << " iterations" << std::endl;
          reported2 = TRUE;
        }
        if (b != 2*bprime) {
          std::cout << "After " << i << " iterations: " << b << std::endl;
          std::cout << b << std::endl;
          break;
        }
        b = bprime;
        i++;
      }
      std::cout << "long double (" << sizeof(c)*8 << "bits):" << std::endl;
      i = 0; reported = FALSE; reported2 = FALSE;
      while (1) {
        long double cprime = c/2;
        if (!reported && (1.0-c)*c == c) {
          std::cout << "After " << i << " iterations, ((1.0-c)*c == c) at c = " << c << std::endl;
          reported = TRUE;
        }
        // try c^2 directly
        if (!reported2 && c*c == 0) {
          std::cout << "c^2 becomes zero at " << i << " iterations" << std::endl;
          reported2 = TRUE;
        }
        if (c != 2*cprime) {
          std::cout << "After " << i << " iterations: " << c << std::endl;
          std::cout << c << std::endl;
          break;
        }
        c = cprime;
        i++;
      }
    }
    [rdv@dhcp-143-221 architecture]$ g++ -o double-double-double double-double-double.cpp 
    [rdv@dhcp-143-221 architecture]$ ./double-double-double 
    float (32bits):
    After 65 iterations, ((1.0-f)*f == f) at f = 2.710505431213761085018632002174854278564453125e-20
    After 149 iterations: 1.4012984643248170709237295832899161312802619418765e-45
    double (64bits):
    After 65 iterations, ((1.0-a)*a == a) at a = 2.710505431213761085018632002174854278564453125e-20
    After 1074 iterations: 4.9406564584124654417656879286822137236505980261432e-324
    long double (96bits):
    After 65 iterations, ((1.0-b)*b == b) at b = 2.710505431213761085018632002174854278564453125e-20
    b^2 becomes zero at 8223 iterations
    After 16445 iterations: 3.6451995318824746025284059336194198163990508156936e-4951
    3.6451995318824746025284059336194198163990508156936e-4951
    long double (96bits):
    After 65 iterations, ((1.0-c)*c == c) at c = 2.710505431213761085018632002174854278564453125e-20
    c^2 becomes zero at 8223 iterations
    After 16445 iterations: 3.6451995318824746025284059336194198163990508156936e-4951
    3.6451995318824746025284059336194198163990508156936e-4951
    

Lecture 4

  1. Find and describe a real-world pipeline. Include:
    Choice dependent
    1. The number of stages
    2. Functionality of each stage
    3. Interlocking between stages
    4. Any hazards
    5. How balance in execution time is maintained
  2. We noted right at the end of class that pipeline hazards equate to arrows flowing right to left on the figure above. Identify the arrows on the diagram above by type and indicate the maximum delay that the hazard can cause.
    The upper reverse arrow is a control hazard, and the lower one is a data hazard. Both delay the pipeline four clock cycles.
  3. The three pipeline programs we "executed" during class today are linked to below. Calculate the following for each:
    1. The number of instructions that must be executed. Don't forget to account for the loop in program 3. (n.b.: the #-28 in the branch is decimal!)
    2. The number of clock cycles the entire program takes, accounting for data and control hazards.
    3. The average clock cycles per instruction (CPI) for the three programs.
    First program:
    10 instructions
    no hazards
    time = 10 clock cycles + 4 for pipeline delay = 14 clock cycles
    CPI = 14/10 = 1.4
    

    Second program:

    The hazards have to be worked out carefully, but the normal delay caused by the data hazard is two clock cycles.

    10 instructions
    one data hazard, instruction 0x814, +2 clock cycles
    time = 14+2 = 16 clock cycles
    CPI = 16/10 = 1.6
    

    Third program:

    As with data hazards, the normal delay caused by the control hazard is two clock cycles.

    n.b.: There is a bug in my program! The loop is seven instructions long (from the MUL at 0x814 to the BEQZ at 0x82C). It was supposed to be executed three times (with R1=2, 1, 0) but the instruction should have been a BNEZ instead of BEQZ. If you used either, you will get full credit.

    Taking the branch:

    5+3*7+2 = 28 instructions
    one control hazard, instruction 0x82C, +2 clock cycles
    per taken loop (twice!)
    time = 28 instructions issued + 2 stalls * 2 cycles + 4 for pipeline
          latency = 36 clock cycles
    CPI = 36/28 = 1.29
    

    Not taking the branch:

    5+1*7+2 = 14 instructions
    no control hazards
    time = 14 instructions issued + 0 stalls * 2 cycles + 4 for pipeline
          latency = 18 clock cycles
    CPI = 18/14 = 1.29
    

Lecture 5

  1. Clearly we need more practice with hexadecimal notation and bit fields. Assuming the above example of the AMD Opteron processor, with 40-bit physical addresses, identify the tag, index, and block offset of the following addresses:
    1. 0x1000
    2. 0x1000 = 0b0000 0000 0000 0000 0000 0000 0001 0000 0000 0000
      tag (25 bits): 0b0000 0000 0000 0000 0000 0000 0___ ____ ____ ____ = 0x0
      index (9 bits): 0b____ ____ ____ ____ ____ ____ _001 0000 00__ ____ = 0x40
      block offset (6 bits): 0b____ ____ ____ ____ ____ ____ ____ ____ __00 0000 = 0x0
    3. 0x1004
    4. 0x1004 = 0b0000 0000 0000 0000 0000 0000 0001 0000 0000 0100
      tag (25 bits): 0b0000 0000 0000 0000 0000 0000 0___ ____ ____ ____ = 0x0
      index (9 bits): 0b____ ____ ____ ____ ____ ____ _001 0000 00__ ____ = 0x40
      block offset (6 bits): 0b____ ____ ____ ____ ____ ____ ____ ____ __00 0100 = 0x04
    5. 0x1044
    6. 0x1044 = 0b0000 0000 0000 0000 0000 0000 0001 0000 0100 0100
      tag (25 bits): 0b0000 0000 0000 0000 0000 0000 0___ ____ ____ ____ = 0x0
      index (9 bits): 0b____ ____ ____ ____ ____ ____ _001 0000 01__ ____ = 0x41
      block offset (6 bits): 0b____ ____ ____ ____ ____ ____ ____ ____ __00 0100 = 0x04
    7. 0x1049
    8. 0x1049 = 0b0000 0000 0000 0000 0000 0000 0001 0000 0100 1001
      tag (25 bits): 0b0000 0000 0000 0000 0000 0000 0___ ____ ____ ____ = 0x0
      index (9 bits): 0b____ ____ ____ ____ ____ ____ _001 0000 01__ ____ = 0x41
      block offset (6 bits): 0b____ ____ ____ ____ ____ ____ ____ ____ __00 1001 = 0x09
    9. 0x81231000
    10. 0x81231000 = 0b0000 0000 1000 0001 0010 0011 0001 0000 0000 0000
      tag (25 bits): 0b0000 0000 1000 0001 0010 0011 0___ ____ ____ ____ = 0x10246
      index (9 bits): 0b____ ____ ____ ____ ____ ____ _001 0000 00__ ____ = 0x40
      block offset (6 bits): 0b____ ____ ____ ____ ____ ____ ____ ____ __00 0000 = 0x0
    11. 0xA1239004
    12. 0xA1239004 = 0b0000 0000 1010 0001 0010 0011 1001 0000 0000 0100
      tag (25 bits): 0b0000 0000 1010 0001 0010 0011 1___ ____ ____ ____ = 0x14247
      index (9 bits): 0b____ ____ ____ ____ ____ ____ _001 0000 00__ ____ = 0x40
      block offset (6 bits): 0b____ ____ ____ ____ ____ ____ ____ ____ __00 0100 = 0x04
    13. 0xA1239913
    14. 0xA1239913 = 0b0000 0000 1010 0001 0010 0011 1001 1001 0001 0011
      tag (25 bits): 0b0000 0000 1010 0001 0010 0011 1___ ____ ____ ____ = 0x14247
      index (9 bits): 0b____ ____ ____ ____ ____ ____ _001 1001 00__ ____ = 0x40
      block offset (6 bits): 0b____ ____ ____ ____ ____ ____ ____ ____ __01 0011 = 0x13
  2. Assume a system with a 1GHz system clock, only one level of cache, and that L1 access is 1 clock cycle and main memory is 50 clock cycles.
    1. Plot the average memory access time as a function of hit rate.
      Average Memory Access Time plot
      gnuplot> load "avg-mem-access-time.gp"
      gnuplot> 
      [1]+  Stopped                 gnuplot
      [rdv@new-host architecture]$ cat avg-mem-access-time.gp
      reset
      clear
      set xlabel "Cache Hit Rate"
      set ylabel "nsec"
      set title "Average Memory Access Time"
      set term post eps mono "Helvetica" 24
      set out "avg-mem-access-time.eps"
      plot [0:1.0] 1+(1-x)*49 notitle
      
    2. Plot the expected completion time for one billion instructions as a function of cache hit rate.
      Same graph, but multiplied by one billion.

Lecture 6

  1. Write down the equation for calculating the average memory access time for a two-level cache.
    If the equation for one level of cache is
    Avg. memory access time = Hit time +
    Miss rate × Miss penalty
    then the equation for two levels must be something like
    Avg. memory access time = (L1 hit rate) × (L1 Hit time) +
    (L2 hit rate) × (L2 Hit time) +
    (Miss rate) × (main memory latency)
  2. Following the example from our lecture on pipelining, suggest a modification to the five-stage pipeline to accommodate the memory architecture above.
    I'm afraid this question is not very well written. My apologies. It's not even clear which "above" memory architecture I meant.
    1. How many stages must the pipeline have to support the L1 cache?
    2. If the L1 cache responds in 1 clock cycle, the five-cycle pipeline is fine. However, if L1 takes 2 clock cycles, we must delay the write back stage one cycle.
    3. When data must be retrieved from L2 cache, how many cycles will the processor stall?
    4. If the L1 cache responds in 1 clock cycle and the L2 cache takes 7 cycle, then the five-cycle pipeline will stall for 6 clock cycles before the write back stage completes.
    5. Using one of the examples from Fig. 5.28 from the text, how many cycles will the processor stall to retrieve data from main memory?
    6. Assume that DRAM is 30nsec latency, and the processor is e.g. the 2.8GHz AMD Opteron in the first column of Fig. 5.28. At 2.8GHz, each clock cycle is 360psec. 30nsec/360psec = 84 clock cycles. The pipeline depth is not specified, but if we assume that 2 clock cycles is handled perfectly, then the maximum delay to wait for memory would be 84-2 = 82 clock cycles.
  3. Suggest an alternative to waiting for main memory. Can the processor achieve useful work while waiting?
    This is the area of much computer architecture research. Ideas include multi-threaded architectures, out-of-order execution, etc.

Lecture 7

  1. Plot the expected speedup for a parallel system as the number of processors increases. Assume the workload is 1 unit of unparallelizable work, followed by 1000 units of parallelizable work.
    Note that this example is not very realistic; it's rare to get an application that scales this simply.
    1. First, plot the absolute speedup as 1 to 1000 processors are used. I recommend a log-log plot.

      Although I recommended a log-log plot, on second thought, it looks better on a linear plot.

      If there is only one processor, 1001 units of time will be required. With more than one, 1+1000/x time units are required.
      gnuplot> plot [1:1000] 1001/(1+1000/x)
      
      Speedup
    2. Next, plot the efficiency, as a percentage: the speedup divided by the number of processors used.
      Speedup Efficiency
  2. Do problem 4.1 in the textbook. Note: the body of Chapter 4 refers to Exclusive, Shared, and Invalid states, but this problem refers to Modified, Shared, and Invalid. Assume that Modified means the same thing as the Exclusive state we discussed in class. (If you do not have the book, come to my desk and I will give you a copy of the pages.)
    Coming soon.
  3. Do problem 4.5 in the textbook: draw the state diagram that is the equivalent of Figure 4.7 in the textbook, but with both Modified and Exclusive states. In this problem, Exclusive means that only one processor has the data in its cache, but that copy is "clean". (Note that Fig. 4.7 is essentially the combined left and right diagrams in Fig. 4.6.)
    Coming soon.

Lecture 8

  1. Consider the 32x32x64 torus of Blue Gene described above. Calculate the degree, diameter, bisection, and average distance of the network.
    Degree: 6
    Diameter: since it's a torus, in each axis you can only be half the total distance away.
    32/2+32/2+64/2 = 64
    Bisection: To cut the machine in half, you must cut 32x32x2 = 2048 links.
    Average distance: 32. The "average" distance is a slightly funny concept, because it makes assumptions about the traffic pattern. However, in a regular mesh or torus, the average is half the diameter.
  2. A common subroutine in very large simulations calls for matrix transposition. Assume you are using the 65,536 nodes of the 32x32x64 torus to hold 1,048,576x1,048,576 (2^20x2^20) matrix to transpose. Assume each entry is 128 bits (16 bytes).
    1. First, assuming each node holds the same amount of data, how much is in each one?
      The total amount of data is 2^44 bytes, or 16TB (terabytes). We have 2^16 nodes, so each processor must hold 2^(44-16) = 2^28 = 256MB, a very reasonable number.
    2. If each node holds several columns of the matrix, how much data must it transmit to other nodes in order to perform the matrix transpose? (Hint: almost all...)
    3. There are 2^20 columns in the matrix, and 2^16 nodes, so each node holds 2^(20-16) = 2^4 = 16 columns, or a 16x1,048,576 chunk of the matrix.

      To transpose the matrix, all of the data in those columns except the data that remains in this node. One 16x16 = 256x16 = 4096 bytes chunk of the matrix stays in the node, but the rest of the data must be sent to another node, 256MB - 4KB.

    4. How many messages must each node send?
      Each node must send one message to each other node, so 65,536-1 = 65,535 messages.
    5. Assume each link is 1Gbps, and there is no overhead for a message. How long will the transpose take?
      Our 256MB is 2.14*10^9 bits:
      octave:4> 256*1024*1024*8
      ans =  2.1475e+09
      
      To send that much data along one link would require 2.1475 seconds. Fortunately, since we are sending a message to every node in the system, each message will be routed separately, and we can use all six links from each node.
      octave:6> 256*1024*1024*8/1000000000/6
      ans =  0.35791
      
      Unfortunately, we are not talking to our nearest neighbors. Each message must cross an average of 32 links, as noted above. Thus, our actual final time is
      octave:7> 256*1024*1024*8/1000000000/6*32
      ans =  11.453
      
      about 11.5 seconds.
    6. Assume that each link is 1Gbps and that each message has an overhead of 10 microseconds. How long will the transpose take?
      Via the same logic as above, we have 65,535 messages. Each message requires an extra 10usec, so our extra total transmit time is 65535x10usec = 655.35 milliseconds, for a total of 2.1475+0.65535 = 2.8 seconds if we were sending that data on a single link.
      octave:8> (256*1024*1024*8/1000000000+0.65535)/6*32
      ans =  14.948
      

Lecture 9

No homework was assigned.

Lecture 10

Intel says their 80-core processor runs at 1TFLOPS when the clock speed is 3.13GHz. Determine:
  1. FLOPS per processor, assuming all 80 cores are working
    1TFLOPS / 80 processors = 12.5GFLOPS per processor.
  2. Average floating point ops/clock cycle

    12.5GFLOPS / 3.13 GHz = 4 floating point ops per clock cycle! How can that be? Can it really be more than one operation per clock cycle? Looking at the diagram of the chip, we see

    Portion of the PC Perspective diagram
				of the Intel 80-core chip showing the
				FPUs

    The polygon with the "X" in it is a floating point multiplier, and the one with a "+" in it is a floating point adder. Note that there are two sets of those. Thus, to reach four FP ops/second, all four of those must produce a result every clock cycle!

  3. If each pipeline stall costs the full 8 cycles, what percentage of instructions can stall and still meet that performance?
    Based on the above, the pipeline can never stall, or only a very small fraction of the time. Basically, the system must keep all four floating point units (FPUs) running at full speed.
  4. If each pipeline stall costs only one clock cycle, what percentage of instructions can stall and still meet that performance?
    Based on the above, the pipeline can never stall, or only a very small fraction of the time. Basically, the system must keep all four floating point units (FPUs) running at full speed.

Lecture 11

  1. This table contains a lot of information on many disk drives from 1975 to 1997, but none since. Pick any recent disk drive and fill in the information.
  2. Answer depends on the disk drive you examine.
  3. For the disk drive you have picked, calculate:
    1. How long it will take to read the entire disk sequentially?
      Answer depends on the disk drive you examine. My current laptop disk drive is 160GB and runs at about 24 MB/sec. 160,000/24 = 6,700 seconds, about two hours.
    2. How long it will take to read the entire disk one 512 byte sector at a time, in random order?
      Answer depends on the disk drive you examine. My current laptop disk drive is 160GB (320M sectors) and runs at about 100 requests per second, 320,000,000 / 100 = 3,200,000 seconds = about a month!
  4. Who controls the specification for each of these types of buses?
    1. Frontside bus
      The CPU and chipset manufacturer, i.e., Intel or AMD.
    2. Memory bus
      For example, for DDRAM, the specification is created by JEDEC, an electronics industry standardization body with a hundred or so member companies.
    3. PCI
      PCI-SIG, founded in 1992. I mentioned that buses that must be standard have a longer life cycle; PCI was not superseded until 2004, by PCI Express, and even today most machines have a PCI bus.
    4. SCSI
      T10

Lecture 12

No homework was assigned.

Additional Information

その他