慶應義塾大学
2015年度 秋学期

コンピューター・アーキテクチャ
Computer Architecture

Spring 2016 科目コード: 35010 / 2単位
カテゴリ:
開講場所:SFC
授業形態:講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第2回 6月09日 Lecture 2, June 9:
Faster!

Outline of This Lecture

Performance Measurement

Last week, we discussed some performance graphs, plotting (a) wall-clock time, (b) speedup, and (c) efficiency versus the number of threads used on a particular problem. This week, each of you will see how to take that data, and how to recreate the graphs. You need R, an account on a Unix-like machine with more than one core and an OpenMP-capable compiler, and the code available from the link below at Homework.

Note on compilers: On the Mac, Apple's supplied compiler is called gcc, but it is really the Clang compiler, not the true GNU gcc. That would be fine, since it's mostly compatible, but there are some features of one not present in the other. In particular, for this exercise you will need the parallel extensions for C known as OpenMP, and Clang does not provide them. And, of course, on a Windows box, your compiler will likely be different. So, you can do one of four things to get access to a real gcc: a) use ccx, the servers provided by campus; b) use your own Linux or FreeBSD box, or one from your lab; c) install gcc5 (or gcc4) for Mac on your machine, or d) install a Vagrant VirtualBox. I have done all of the above. If you want to use gcc-5 on your Mac, remember to change the first line in both Makefiles to:

CC=gcc-5

That should allow you to compile the OpenMP program properly.

  1. Get the tarball.
  2. Unpack it using tar xvfz.
  3. Confirm that two of the parameters are reasonable. Look at the top of main.c, where you should see
    #define LOOP 1000
    #define REGISTER_SIZE 20
    
    Edit the file and fix those parameters, if they are different.
  4. Build both the library and the program hw1 using make.
  5. Execute time ./hw1 and record the results in a file that looks like
    1 440 435 0.2
    2 229 451 0.36
    3 161 475 0.42
    4 122 482 0.4
    
  6. You will want to run this with various parameters, including the number of threads, possibly the number of loops, and the register size.
  7. Back on your laptop, in R, using the file fun.R or by hand, you can plot the results using commands like
    > ARMDATA1 <- matrix(scan("armstrong-one-run.dat"),ncol=4,byrow=T)
    Read 48 items
    > plot(ARMDATA1[,1],ARMDATA1[,2])
    > x <- seq(1,16)
    > y = 440/x
    > points(x,y,type="l")
    > help(plot)
    > plot(ARMDATA1[,1],ARMDATA1[,2],log="y")
    > points(x,y,type="l")
    > plot(ARMDATA1[,1],ARMDATA1[,2],log="xy")
    > points(x,y,type="l")
    

Here is my log from creating a vagrant machine on my laptop on 2016/6/8. It took about ten minutes for the hashicorp to download and install properly, after I already had vagrant installed and had previously used it.

Van-Meter-Rodneys-MacBook-Pro:computer-architecture-2016 rdv$ vagrant init hashicorp/precise64
A `Vagrantfile` has been placed in this directory. You are now
ready to `vagrant up` your first virtual environment! Please read
the comments in the Vagrantfile as well as documentation on
`vagrantup.com` for more information on using Vagrant.
Van-Meter-Rodneys-MacBook-Pro:computer-architecture-2016 rdv$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Box 'hashicorp/precise64' could not be found. Attempting to find and install...
    default: Box Provider: virtualbox
    default: Box Version: >= 0
==> default: Loading metadata for box 'hashicorp/precise64'
    default: URL: https://vagrantcloud.com/hashicorp/precise64
==> default: Adding box 'hashicorp/precise64' (v1.1.0) for provider: virtualbox
    default: Downloading: https://atlas.hashicorp.com/hashicorp/boxes/precise64/versions/1.1.0/providers/virtualbox.box
==> default: Successfully added box 'hashicorp/precise64' (v1.1.0) for 'virtualbox'!
==> default: Importing base box 'hashicorp/precise64'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'hashicorp/precise64' is up to date...
==> default: Setting the name of the VM: computer-architecture-2016_default_1465374178790_32575
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
==> default: Forwarding ports...
    default: 22 => 2222 (adapter 1)
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: Warning: Remote connection disconnect. Retrying...
==> default: Machine booted and ready!
==> default: Checking for guest additions in VM...
    default: The guest additions on this VM do not match the installed version of
    default: VirtualBox! In most cases this is fine, but in rare cases it can
    default: prevent things such as shared folders from working properly. If you see
    default: shared folder errors, please make sure the guest additions within the
    default: virtual machine match the version of VirtualBox you have installed on
    default: your host and reload your VM.
    default: 
    default: Guest Additions Version: 4.2.0
    default: VirtualBox Version: 4.3
==> default: Mounting shared folders...
    default: /vagrant => /Users/rdv/old/rdv/keio/sfc/teaching/rdvteachingweb/computer-architecture/computer-architecture-2016

Here is some data I took in fall 2015:

Van-Meter-Rodneys-MacBook-Pro:vagrant-files rdv$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Clearing any previously set forwarded ports...
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
==> default: Forwarding ports...
    default: 22 => 2222 (adapter 1)
==> default: Running 'pre-boot' VM customizations...
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: Warning: Connection timeout. Retrying...
    default: Warning: Remote connection disconnect. Retrying...
==> default: Machine booted and ready!
==> default: Checking for guest additions in VM...
==> default: Mounting shared folders...
    default: /vagrant => /Users/rdv/new/quantum/vagrant-files
==> default: Machine already provisioned. Run `vagrant provision` or use the `--provision`
==> default: to force provisioning. Provisioners marked to run always will still run.
Van-Meter-Rodneys-MacBook-Pro:vagrant-files rdv$ vagrant ssh
Last login: Mon Oct  5 00:17:27 2015 from 10.0.2.2
-bash: warning: setlocale: LC_CTYPE: cannot change locale (UTF-8): No such file or directory
[vagrant@localhost ~]$ more /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 70
model name	: Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
stepping	: 1
cpu MHz		: 1997.136
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pa
t pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good pn
i ssse3 lahf_lm
bogomips	: 3994.27
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 70
model name	: Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
stepping	: 1
cpu MHz		: 1997.136
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pa
t pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good pn
i ssse3 lahf_lm
bogomips	: 3994.27
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 70
model name	: Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
stepping	: 1
cpu MHz		: 1997.136
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pa
t pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good pn
i ssse3 lahf_lm
bogomips	: 3994.27
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 70
model name	: Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
stepping	: 1
cpu MHz		: 1997.136
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pa
t pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good pn
i ssse3 lahf_lm
bogomips	: 3994.27
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

[vagrant@localhost ~]$ 
[vagrant@localhost ~]$ cd architecture/
[vagrant@localhost architecture]$ ls
architecture-hw1-111007.tgz  data  src
[vagrant@localhost architecture]$ cd src
[vagrant@localhost src]$ ls
AAAREADME.txt  hw1  qulib
[vagrant@localhost src]$ cd hw1/
[vagrant@localhost hw1]$ ls
Makefile  hw1  main.c  main.o
[vagrant@localhost hw1]$ time ./hw1

real	0m7.044s
user	0m27.129s
sys	0m1.013s
[vagrant@localhost hw1]$ export OMP_NUM_THREADS=1
[vagrant@localhost hw1]$ time ./hw1

real	0m10.428s
user	0m10.386s
sys	0m0.038s
[vagrant@localhost hw1]$ export OMP_NUM_THREADS=2
[vagrant@localhost hw1]$ time ./hw1

real	0m7.412s
user	0m14.370s
sys	0m0.447s
[vagrant@localhost hw1]$ export OMP_NUM_THREADS=3
[vagrant@localhost hw1]$ time ./hw1

real	0m6.995s
user	0m20.348s
sys	0m0.625s
[vagrant@localhost hw1]$ export OMP_NUM_THREADS=4
[vagrant@localhost hw1]$ time ./hw1

real	0m6.860s
user	0m26.468s
sys	0m0.943s
[vagrant@localhost hw1]$ export OMP_NUM_THREADS=5
[vagrant@localhost hw1]$ time ./hw1

real	0m7.361s
user	0m21.475s
sys	0m1.526s
[vagrant@localhost hw1]$ export OMP_NUM_THREADS=6
[vagrant@localhost hw1]$ time ./hw1

real	0m7.391s
user	0m21.657s
sys	0m1.494s
[vagrant@localhost hw1]$ export OMP_NUM_THREADS=8
[vagrant@localhost hw1]$ time ./hw1

real	0m6.954s
user	0m20.681s
sys	0m1.600s
[vagrant@localhost hw1]$ export OMP_NUM_THREADS=16
[vagrant@localhost hw1]$ time ./hw1

real	0m6.762s
user	0m20.034s
sys	0m2.429s
[vagrant@localhost hw1]$ 

I ran this on a Vagrant VirtualBox running CentOS on my Mac. You can see that the performance improvement was only modest. Why do you think that is?

定量てきなデザイン概念
Quantitative Principles of Design

Let's talk about Hennessy & Patterson's Five Principles:

  1. Take Advantage of Parallelism
  2. Principle of Locality
  3. Focus on the Common Case
  4. Amdahl's Law
  5. The Processor Performance Equation
I would add to this one imperative: Achieve Balance.

Take Advantage of Parallelism

Parallelism can be found by using multiple processors on different parts of the problem, or multiple functional units (floating point units, disk drives, etc.), or by pipelining, dividing an individual computer instruction into several parts and executing the parts of different instructions at the same time in different parts of the CPU.

Principle of Locality

Programs and data tend to reuse data and instructions that have been recently used. There are two forms of locality: spatial and temporal. Locality is what allows a cache memory to work.

Focus on the Common Case

The things that are done a lot should be fast; the things that are rare may be slow.

Amdahl's Law

Amdahl's Law tells us how much improvement is possible by making the common case fast, or by parallelizing part of the algorithm. In the example below, 3/5 of the algorithm can be parallelized, meaning that three times as much hardware applied to the problem gains us only a reduction from five time units to three.

Example of Amdahl's Law, parallel and
				serial portions.

Some problems, most famously graphics, are known as "embarrassingly parallel" problems, in which extracting parallelism is trivial, and performance is primarily determined by input/output bandwidth and the number of processing elements available. More generally, the parallelism achievable is determined by the dependency graph. Creating that graph and scheduling operations to maximize the parallelism and enforce correctness is generally the shared responsibility of the hardware architecture and the compiler.

Dependency graph for the
					     above figure.

プロセッサー・パフォマンス定式
The Processor Performance Equation

CPU time = (seconds )/ program = (Instructions )/ program × (Clock cycles )/ Instruction × (Seconds )/ Clock cycle

What's in a Computer?

(Here's the fun part...)

Wikipedia
						  motherboard block
						  diagram

宿題
Homework

The source you need, including script files, are available in a tar file here.

The specification for OpenMP, and a "summary card" for C and C++, are available here. The latest version is 4.5 (released Nov. 2015), but there is a Japanese version of the 3.0 spec available. 最新のバージョンは3.1だ が、3.0の日本語版はありますよ!

This week's homework (submit via SFS, due 6/16):

Note that you should explain your results, not just plot the data and submit! Why do you get the values you get? What have you learned?

Also note that most, but not all, of this is what we did in class.

  1. Replot the data from our simple parallel program. How much faster does it get? Is the speedup the same for all problem sizes?
  2. In architecture/src/qulib/sim.c, you will find the functions cnot() and Hadamard(). In the statement
    /* XXX parallelizing this loop is tricky, but it's a "big" loop, so worth doing... */
    #pragma omp parallel for schedule(static,8192) private(j,k,z)
    

    the number 8192 indicates the size of the chunk of the large array that each thread executes. Change that number both smaller and larger in both functions to see the effects. Save these as separate data sets, and plot them all together on one plot.
    1. First, eliminate the "schedule" altogether; try it with
      #pragma omp parallel for private(j,k,z)
      
    2. Next, try it with schedule(static,16).
    3. schedule(static,256).
    4. schedule(static,1024).
    5. schedule(static,4096).
    6. schedule(static,16384).
    7. By now, you should have some idea of what values will work well. Choose the optimal value for the schedule size for this application and machine.
  3. Plot the results with different numbers of threads in a set of graphs like the ones from last week. You can put all of the data on one graph (giving you three graphs, total) or on separate graphs.
  4. Describe what you learned. What did you learn from doing this? What is the meaning of the plots? Does the compiler & runtime system do a better job than you can do by hand? What level of parallelism seems to be achievable for this particular problem? Why? How many threads would you use if you were trying to optimize (minimize) wall clock time? You should fix the labels on the axes on the plots to reflect what you now know about the meaning of the data. Also, don't forget to document your hardware configuration. How many cores do you have, how much memory, how big are the caches?

Next Lecture

Next lecture:

第3回 プロセッサー:命令の基本
Lecture 3: Processors: Basics of Instruction Sets

以下で、P-Hはコンピュータの構成と設計~ハードウエアとソフトウエアの インタフェース 第3版、 H-Pはコンピュータアーキテクチャ 定量的アプローチ 第4版.

Below, P-H is Computer Organization and Design: The Hardware-Software Interface, and H-P is Computer Architecture: A Quantitative Approach.

Readings for next time:

Additional Information

その他