Note: these are totally complete yet, but will be shortly. Not all of the problems have exact solutions, they will depend on your choices and on the computer you try to run them on.
試験は英語と日本語両方で書いてある。
#include <stdio.h> main() { printf("Hello, world\n"); }
Mine is a 2GHz Intel Core 2 Duo, with two processors in the chip.
From my Linux machine:
[rdv@localhost ~]$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz stepping : 6 cpu MHz : 1000.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 3994.36 clflush size : 64 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz stepping : 6 cpu MHz : 1000.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 3990.09 clflush size : 64
(from dmesg on a Linux machine, at startup) CPU: L1 I cache: 32K, L1 D cache: 32K CPU: L2 cache: 4096K CPU: Physical Processor ID: 0 CPU: Processor Core ID: 0 and CPU: L1 I cache: 32K, L1 D cache: 32K CPU: L2 cache: 4096K CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1I believe that the L2 cache is shared.
ata1: SATA max UDMA/133 cmd 0x000101f0 ctl 0x000103f6 bmdma 0x000118b0 irq 14 ata2: PATA max UDMA/100 cmd 0x00010170 ctl 0x00010376 bmdma 0x000118b8 irq 15 usb 5-1: configuration #1 chosen from 1 choice ata1.00: ATA-7: Hitachi HTS541616J9SA00, SB4OC70P, max UDMA/100 ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (not used) ata1.00: configured for UDMA/100 scsi 0:0:0:0: Direct-Access ATA Hitachi HTS54161 SB4O PQ: 0 ANSI: 5 sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[rdv@localhost Desktop]$ uname -a Linux localhost.localdomain 2.6.23.9-85.fc8 #1 SMP Fri Dec 7 15:49:59 EST 2007 i686 i686 i386 GNU/Linux [rdv@localhost Desktop]$ gcc --version gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
.file "hello.c" .section .rodata .LC0: .string "Hello, world" .text .globl main .type main, @function main: leal 4(%esp), %ecx andl $-16, %esp pushl -4(%ecx) pushl %ebp movl %esp, %ebp pushl %ecx subl $4, %esp movl $.LC0, (%esp) call puts addl $4, %esp popl %ecx popl %ebp leal -4(%ecx), %esp ret .size main, .-main .ident "GCC: (GNU) 4.1.2 20070925 (Red Hat 4.1.2-27)" .section .note.GNU-stack,"",@progbits
Mine is a 2GHz Intel Core 2 Duo, with two processors in the chip.
See above.
x86-based (also called IA-32) architectures, as discussed on page J-45 of Appendix J of the textbook, are extended accumulator instruction set machines. The original 8080 was a straight accumulator instruction set, but over the years many registers have been added. Unfortunately, they are not completely interchangeable; some instructions can use operands in only some of the registers.
int b(int x) { return x+2; }; void a(int x) { b(x); } main() { a(14); }
[rdv@localhost architecture]$ gdb stack GNU gdb Red Hat Linux (6.6-40.fc8rh) Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"... Using host libthread_db library "/lib/libthread_db.so.1". (gdb) start Breakpoint 1 at 0x80483c3: file stack.c, line 14. Starting program: /home/rdv/keio/sfc/teaching/architecture/stack warning: Missing the separate debug info file: /usr/lib/debug/.build-id/ac/2eeb206486bb7315d6ac4cd64de0cb50838ff6.debug warning: Missing the separate debug info file: /usr/lib/debug/.build-id/ba/4ea1118691c826426e9410cafb798f25cefad5.debug main () at stack.c:14 14 a(14); (gdb) p $sp $1 = (void *) 0xbf9c02a0 (gdb) step a (x=14) at stack.c:8 8 b(x); (gdb) p x $2 = 14 (gdb) p &x $3 = (int *) 0xbf9c02a0 (gdb) p $sp $4 = (void *) 0xbf9c0294 (gdb) step b (x=14) at stack.c:3 3 return x+2; (gdb) p x $5 = 14 (gdb) p &x $6 = (int *) 0xbf9c0294 (gdb)
(gdb) list 3 main() 4 { 5 float a = 1.1, b = 1.2; 6 int i; 7 8 for ( i = 0 ; i < 100000000 ; i++ ) { 9 a = expf(1.0); 10 a*=b; 11 b*=a; 12 a*=b; (gdb) ptype a type = float (gdb) whatis a type = float (gdb) x/wt &a 0xbf98f228: 01000000010010010000111111010000 (gdb) set var a=3.1415926 (gdb) x/wt &a 0xbf98f228: 01000000010010010000111111011010 (gdb) p a $3 = 3.1415925 (gdb)
(gdb) set var b = 3.0000002 (gdb) x/wt &b 0xbf98f22c: 01000000010000000000000000000001
[rdv@dhcp-143-221 architecture]$ cat int.c main() { int a = 7, b = 2, c = 5, i; for ( i = 0 ; i < 100000000 ; i++ ) { c += a; a += b; b += c; a += a; a += a; b += b; c += a; a += b; b += a; c += a; c += a; a += b; b += c; a += a; a += a; b += b; c += a; a += b; b += a; c += a; a = 7; b = 2; c = 5; } } [rdv@dhcp-143-221 architecture]$ gcc -o int int.c [rdv@dhcp-143-221 architecture]$ time ./int real 0m3.839s user 0m3.828s sys 0m0.010s [rdv@dhcp-143-221 architecture]$ time ./int real 0m3.859s user 0m3.846s sys 0m0.011s [rdv@dhcp-143-221 architecture]$ time ./int real 0m3.815s user 0m3.809s sys 0m0.003s
The inner loop is ten instructions, and the loop is executed 100,000,000 times, so it runs 2 billion integer additions. 2E9 / 3.8 seconds = 525M integer additions/second. On a 2GHz processor, that is one result every 3.8 clock cycles.
Similar programs are easily written for the other operations.
[rdv@dhcp-143-221 architecture]$ cat fp.c #include <math.h> main() { float a = 1.1, b = 1.2; int i; for ( i = 0 ; i < 100000000 ; i++ ) { a*=b; b*=a; a*=b; b*=a; a*=b; b*=a; a*=b; b*=a; a*=b; b*=a; a = 1.1; b = 1.2; } } [rdv@dhcp-143-221 architecture]$ gcc -o fp fp.c [rdv@dhcp-143-221 architecture]$ time ./fp real 0m3.026s user 0m3.016s sys 0m0.009s [rdv@dhcp-143-221 architecture]$ time ./fp real 0m3.070s user 0m3.059s sys 0m0.011s [rdv@dhcp-143-221 architecture]$ time ./fp real 0m2.938s user 0m2.936s sys 0m0.002s [rdv@dhcp-143-221 architecture]$ time ./fp real 0m2.917s user 0m2.907s sys 0m0.010s
The inner loop is ten instructions, and the loop is executed 100,000,000 times, so it runs one billion floating point multiplies. One billion FP multiplies in 3 seconds is 3.3E8 FP multiplies/second, or one every 6.6 clock cycles with a 2.0GHz processor.
However, there are a bunch of data dependencies in that code: if we double the number of FP multiplies, the amount of time less than doubles.
[rdv@dhcp-143-221 architecture]$ cat fp2.c #include <math.h> main() { float a = 1.1, b = 1.2, c = 1.3, d = 1.4; int i; for ( i = 0 ; i < 100000000 ; i++ ) { a*=b; c*=d; b*=a; d*=c; a*=b; c*=d; b*=a; d*=c; a*=b; c*=d; b*=a; d*=c; a*=b; c*=d; b*=a; d*=c; a*=b; c*=d; b*=a; d*=c; a = 1.1; b = 1.2; c = 1.3; d = 1.4; } } [rdv@dhcp-143-221 architecture]$ gcc -o fp2 fp2.c [rdv@dhcp-143-221 architecture]$ time ./fp2 real 0m5.073s user 0m5.068s sys 0m0.002s [rdv@dhcp-143-221 architecture]$ time ./fp2 real 0m5.142s user 0m5.139s sys 0m0.001s [rdv@dhcp-143-221 architecture]$ time ./fp2 real 0m5.085s user 0m5.079s sys 0m0.002s
In that example, 2 billion instructions are executed in 5 seconds, for 4E8 FP multiplies/second, one every 5 clock cycles.
(gdb) set var a=3.1415926 (gdb) x/wt &a 0xbf98f228: 01000000010010010000111111011010 (gdb) p a $3 = 3.1415925
The last digit is different. That suggests that there is a problem, but at the point it is not identified. It might be in the routine that converts floating point binary to human-readable ASCII.
You can see from the code and output below that it is possible for floating point operations to become inaccurate after only a few dozen divisions, or additions of small numbers to large ones.
[rdv@dhcp-143-221 architecture]$ cat double-double-double.cpp // first, the standard libraries #include <fstream> #include <iostream> #include <sstream> #include <string> #include <set> #include <map> #include <cmath> #include <iterator> #include <vector> #define FALSE 0 #define TRUE 1 main() { float f = 1.0; double a = 1.0; long double b = 1.0; long double c = 1.0; int i; int reported, reported2; std::cout.precision(50); std::cout << "float (" << sizeof(f)*8 << "bits):" << std::endl; i = 0; reported = FALSE; reported2 = FALSE; while (1) { float fprime = f/2; // n.b.: where this reports will depend on the architecture, the compiler, // and the compiler optimization switches! Might optimize to 1.0-f^2, // in which case the question is, when does f^2 == 0 // but an expression similar to this likely occurs when manipulating // fidelities in a quantum simulation, since the fidelities sum to 1.0. if (!reported && (1.0-f)*f == f) { std::cout << "After " << i << " iterations, ((1.0-f)*f == f) at f = " << f << std::endl; reported = TRUE; } // try f^2 directly. // Note that on many machines this will *never* trip because // the CPU internal floating format is double, and we'll run out of precision // on the divide by two first if (!reported2 && f*f == 0) { std::cout << "f^2 becomes zero at " << i << " iterations" << std::endl; reported2 = TRUE; } if (f != 2*fprime) { std::cout << "After " << i << " iterations: " << f << std::endl; break; } f = fprime; i++; } std::cout << "double (" << sizeof(a)*8 << "bits):" << std::endl; i = 0; reported = FALSE; reported2 = FALSE; while (1) { double aprime = a/2; if (!reported && (1.0-a)*a == a) { std::cout << "After " << i << " iterations, ((1.0-a)*a == a) at a = " << a << std::endl; reported = TRUE; } // try a^2 directly if (!reported2 && a*a == 0) { std::cout << "a^2 becomes zero at " << i << " iterations" << std::endl; reported2 = TRUE; } if (a != 2*aprime) { std::cout << "After " << i << " iterations: " << a << std::endl; break; } a = aprime; i++; } std::cout << "long double (" << sizeof(b)*8 << "bits):" << std::endl; i = 0; reported = FALSE; reported2 = FALSE; while (1) { long double bprime = b/2; if (!reported && (1.0-b)*b == b) { std::cout << "After " << i << " iterations, ((1.0-b)*b == b) at b = " << b << std::endl; reported = TRUE; } // try b^2 directly if (!reported2 && b*b == 0) { std::cout << "b^2 becomes zero at " << i << " iterations" << std::endl; reported2 = TRUE; } if (b != 2*bprime) { std::cout << "After " << i << " iterations: " << b << std::endl; std::cout << b << std::endl; break; } b = bprime; i++; } std::cout << "long double (" << sizeof(c)*8 << "bits):" << std::endl; i = 0; reported = FALSE; reported2 = FALSE; while (1) { long double cprime = c/2; if (!reported && (1.0-c)*c == c) { std::cout << "After " << i << " iterations, ((1.0-c)*c == c) at c = " << c << std::endl; reported = TRUE; } // try c^2 directly if (!reported2 && c*c == 0) { std::cout << "c^2 becomes zero at " << i << " iterations" << std::endl; reported2 = TRUE; } if (c != 2*cprime) { std::cout << "After " << i << " iterations: " << c << std::endl; std::cout << c << std::endl; break; } c = cprime; i++; } } [rdv@dhcp-143-221 architecture]$ g++ -o double-double-double double-double-double.cpp [rdv@dhcp-143-221 architecture]$ ./double-double-double float (32bits): After 65 iterations, ((1.0-f)*f == f) at f = 2.710505431213761085018632002174854278564453125e-20 After 149 iterations: 1.4012984643248170709237295832899161312802619418765e-45 double (64bits): After 65 iterations, ((1.0-a)*a == a) at a = 2.710505431213761085018632002174854278564453125e-20 After 1074 iterations: 4.9406564584124654417656879286822137236505980261432e-324 long double (96bits): After 65 iterations, ((1.0-b)*b == b) at b = 2.710505431213761085018632002174854278564453125e-20 b^2 becomes zero at 8223 iterations After 16445 iterations: 3.6451995318824746025284059336194198163990508156936e-4951 3.6451995318824746025284059336194198163990508156936e-4951 long double (96bits): After 65 iterations, ((1.0-c)*c == c) at c = 2.710505431213761085018632002174854278564453125e-20 c^2 becomes zero at 8223 iterations After 16445 iterations: 3.6451995318824746025284059336194198163990508156936e-4951 3.6451995318824746025284059336194198163990508156936e-4951
First program: 10 instructions no hazards time = 10 clock cycles + 4 for pipeline delay = 14 clock cycles CPI = 14/10 = 1.4
Second program:
The hazards have to be worked out carefully, but the normal delay caused by the data hazard is two clock cycles.
10 instructions one data hazard, instruction 0x814, +2 clock cycles time = 14+2 = 16 clock cycles CPI = 16/10 = 1.6
Third program:
As with data hazards, the normal delay caused by the control hazard is two clock cycles.
n.b.: There is a bug in my program! The loop is seven instructions long (from the MUL at 0x814 to the BEQZ at 0x82C). It was supposed to be executed three times (with R1=2, 1, 0) but the instruction should have been a BNEZ instead of BEQZ. If you used either, you will get full credit.
Taking the branch:
5+3*7+2 = 28 instructions one control hazard, instruction 0x82C, +2 clock cycles per taken loop (twice!) time = 28 instructions issued + 2 stalls * 2 cycles + 4 for pipeline latency = 36 clock cycles CPI = 36/28 = 1.29
Not taking the branch:
5+1*7+2 = 14 instructions no control hazards time = 14 instructions issued + 0 stalls * 2 cycles + 4 for pipeline latency = 18 clock cycles CPI = 18/14 = 1.29
0x1000 = | 0b | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0001 | 0000 | 0000 | 0000 | ||||||||||
tag (25 bits): | 0b | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0___ | ____ | ____ | ____ | = 0x0 | |||||||||
index (9 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | _001 | 0000 | 00__ | ____ | = 0x40 | |||||||||
block offset (6 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | ____ | ____ | __00 | 0000 | = 0x0 |
0x1004 = | 0b | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0001 | 0000 | 0000 | 0100 | ||||||||||
tag (25 bits): | 0b | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0___ | ____ | ____ | ____ | = 0x0 | |||||||||
index (9 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | _001 | 0000 | 00__ | ____ | = 0x40 | |||||||||
block offset (6 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | ____ | ____ | __00 | 0100 | = 0x04 |
0x1044 = | 0b | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0001 | 0000 | 0100 | 0100 | ||||||||||
tag (25 bits): | 0b | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0___ | ____ | ____ | ____ | = 0x0 | |||||||||
index (9 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | _001 | 0000 | 01__ | ____ | = 0x41 | |||||||||
block offset (6 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | ____ | ____ | __00 | 0100 | = 0x04 |
0x1049 = | 0b | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0001 | 0000 | 0100 | 1001 | ||||||||||
tag (25 bits): | 0b | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0___ | ____ | ____ | ____ | = 0x0 | |||||||||
index (9 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | _001 | 0000 | 01__ | ____ | = 0x41 | |||||||||
block offset (6 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | ____ | ____ | __00 | 1001 | = 0x09 |
0x81231000 = | 0b | 0000 | 0000 | 1000 | 0001 | 0010 | 0011 | 0001 | 0000 | 0000 | 0000 | ||||||||||
tag (25 bits): | 0b | 0000 | 0000 | 1000 | 0001 | 0010 | 0011 | 0___ | ____ | ____ | ____ | = 0x10246 | |||||||||
index (9 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | _001 | 0000 | 00__ | ____ | = 0x40 | |||||||||
block offset (6 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | ____ | ____ | __00 | 0000 | = 0x0 |
0xA1239004 = | 0b | 0000 | 0000 | 1010 | 0001 | 0010 | 0011 | 1001 | 0000 | 0000 | 0100 | ||||||||||
tag (25 bits): | 0b | 0000 | 0000 | 1010 | 0001 | 0010 | 0011 | 1___ | ____ | ____ | ____ | = 0x14247 | |||||||||
index (9 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | _001 | 0000 | 00__ | ____ | = 0x40 | |||||||||
block offset (6 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | ____ | ____ | __00 | 0100 | = 0x04 |
0xA1239913 = | 0b | 0000 | 0000 | 1010 | 0001 | 0010 | 0011 | 1001 | 1001 | 0001 | 0011 | ||||||||||
tag (25 bits): | 0b | 0000 | 0000 | 1010 | 0001 | 0010 | 0011 | 1___ | ____ | ____ | ____ | = 0x14247 | |||||||||
index (9 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | _001 | 1001 | 00__ | ____ | = 0x40 | |||||||||
block offset (6 bits): | 0b | ____ | ____ | ____ | ____ | ____ | ____ | ____ | ____ | __01 | 0011 | = 0x13 |
gnuplot> load "avg-mem-access-time.gp" gnuplot> [1]+ Stopped gnuplot [rdv@new-host architecture]$ cat avg-mem-access-time.gp reset clear set xlabel "Cache Hit Rate" set ylabel "nsec" set title "Average Memory Access Time" set term post eps mono "Helvetica" 24 set out "avg-mem-access-time.eps" plot [0:1.0] 1+(1-x)*49 notitle
Avg. memory access time | = |
Hit time + Miss rate × Miss penalty |
Avg. memory access time | = |
(L1 hit rate) × (L1 Hit time) + (L2 hit rate) × (L2 Hit time) + (Miss rate) × (main memory latency) |
Although I recommended a log-log plot, on second thought, it looks better on a linear plot.
If there is only one processor, 1001 units of time will be required. With more than one, 1+1000/x time units are required.gnuplot> plot [1:1000] 1001/(1+1000/x)
There are 2^20 columns in the matrix, and 2^16 nodes, so each node holds 2^(20-16) = 2^4 = 16 columns, or a 16x1,048,576 chunk of the matrix.
To transpose the matrix, all of the data in those columns except the data that remains in this node. One 16x16 = 256x16 = 4096 bytes chunk of the matrix stays in the node, but the rest of the data must be sent to another node, 256MB - 4KB.
octave:4> 256*1024*1024*8 ans = 2.1475e+09To send that much data along one link would require 2.1475 seconds. Fortunately, since we are sending a message to every node in the system, each message will be routed separately, and we can use all six links from each node.
octave:6> 256*1024*1024*8/1000000000/6 ans = 0.35791Unfortunately, we are not talking to our nearest neighbors. Each message must cross an average of 32 links, as noted above. Thus, our actual final time is
octave:7> 256*1024*1024*8/1000000000/6*32 ans = 11.453about 11.5 seconds.
octave:8> (256*1024*1024*8/1000000000+0.65535)/6*32 ans = 14.948
12.5GFLOPS / 3.13 GHz = 4 floating point ops per clock cycle! How can that be? Can it really be more than one operation per clock cycle? Looking at the diagram of the chip, we see
The polygon with the "X" in it is a floating point multiplier, and the one with a "+" in it is a floating point adder. Note that there are two sets of those. Thus, to reach four FP ops/second, all four of those must produce a result every clock cycle!