Dependency Graphs is the dependency relationship of the instructions.

bgZA8Ggl.png

By analyzing the dependencies of the instructions, the processing unit may be able to re-order or parallelize the instructions to achieve higher performance.

GKetJHpl.png

(An example generated by clang 8.0(x86-64) with "-O3" optimization, which compiles the arithmetic calculation into simpler logics)

Out-of-order Execution (OOE)

Earlier models like Intel 4004 or Motorola 6800 run instructions one by one. With the development of CPU, it's common to have multiple units in charge of different operations in the CPU. While one of the component is running, all the other components simply waiting. Instead of sequential execution, Pipelining was introduced, we could run the second instruction in the next cycle before the first finished.

For example, we have the following assembly code.

InstCommandParametersNote
0LDr1,[a]Read value from memory [a] to register r1.
1ADDr2,r1,r5Plus r1 and r5, save the result to r2.
2SUBr6,r5,r4Minus r5 and r4, save the result to r6.

We here assume that the memory loading fails to hit the Cache, which causes two extra cycles for the memory replies due to the latency from CPU to memory. In modern CPUs, loading data from memroy may cause hundred more cycles than pure register calculation. If we force the CPU to run an assembly at one cycle, like:

OrderCommandParametersNote
0LDr1,[a]Read value from memory [a] to register r1.
1ADDr2,r1,r5Plus r1 and r5, save the result to r2.
2SUBr6,r5,r4Minus r5 and r4, save the result to r6.
Y3KmJmil.png

This would cause serious problem, since there're dependencies between instructions. When running the ADD command, the r1 value still hasn't been read correctly. The second instruction is depending on the result of the first instruction. To solve the problem, we have to introduce an NOP command as the following table, which was called Pipeline Bubbling.

OrderCommandParametersNote
0LDr1,[a]Read value from memory [a] to register r1.
*NOPWaiting for Memory Reading.
1ADDr2,r1,r5Plus r1 and r5, save the result to r2.
2SUBr6,r5,r4Minus r5 and r4, save the result to r6.

But instead of stalling the pipeline, one other more efficient way to solve the problem is that the third instruction has no dependencies on the other instructions. Its pipeline won't be stalled, so that it would be run before the ADD command.

OrderCommandParametersNote
0LDr1,[a]Read value from memory [a] to register r1.
1SUBr6,r5,r4Minus r5 and r4, save the result to r6.
2ADDr2,r1,r5Plus r1 and r5, save the result to r2.
fBLgpO0l.png

In this case, while waiting the memory loading, we do the SUB instruction first. So that, we could save one cycle CPU time. The order of execution was changed, but the result wasn't. This is so called Out-of-order Execution (OOE).

Out-of-order execution is a paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

Dependency and Hazard

Hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect computation results.

Hazard

There're three types of hazard.

  • read after write (RAW), a true dependency
  • write after read (WAR), an anti-dependency
  • write after write (WAW), an output dependency

These problems are very common in OOE. CDC 6600, the first computer with OOE feature, introduced the Scoreboarding algorithm to fix the problems. In a scoreboard, the data dependencies of every instruction are logged. Instructions are released only when the scoreboard determines that there are no conflicts with previously issued and incomplete instructions. If an instruction is stalled because it is unsafe to continue, the scoreboard monitors the flow of executing instructions until all dependencies have been resolved before the stalled instruction is issued.

But in general, only read after write (RAW) requires stall in CPU. the WAR and WAW dependencies could be solved by register renaming.

Example

For example, we have the following assembly code.

InstCommandParametersNote
0LDr1,[a]Read value from memory [a] to register r1.
1ADDr2,r1,r5Plus r1 and r5, save the result to r2.
2SUBr1,r3,r4Minus r3 and r4, save the result to r1.

Inst 1 depends on the result from Inst 0 (RAW, true dependency). And the Inst 2 is going to write to the register still reading by Inst 0 (WAR, an anti-dependency). We could solve the problem by adding bubbling in both Inst 1 and Inst 2.

UrNhG3ol.png

On the other hand, we could rename the r1 in Inst 0 and Inst 1 into r6 saving an extra cycle.

InstCommandParametersNote
0LDr6,[a]Read value from memory [a] to register r6.
1SUBr1,r3,r4Plus r3 and r4, save the result to r1.
2ADDr2,r6,r5Minus r6 and r5, save the result to r2.
oOrvmmgl.png

So that we could runing this code in 3 cycles.

Parallelism

Other than OOE, the analysis on dependency graph can also been used in Parallelism?.

Vectorization

In 1996, Intel introduced the MMX instruction set as the first generation SIMD (single instruction, multiple data) instruction set for Intel x86 family. When running the same mathematical instruction on a set of data, it implies that there be no dependencies between. The CPU could run the instructions in parrallel in order to achieve better speed. Automatic vectorization is a very active area in Computer Science nowadays, which is finding solutions in converting loops into SIMD instructions. In modern compilers like gcc or clang, vectorization is an important feature in optimization.

240px-SIMD.svg.png

Other than CPU, GPU is more easily to become a vector machine. General-purpose computing on graphics processing units (GPGPU) is by running such tasks on parallelized GPU cores. Frameworks like NVIDIA CUDA or OpenCL are designed to help deploying such tasks on GPUs.

Vectorization, including vector machine, GPGPU, and many-core architecture is highly welcomed in supercomputers in the past decade. With proper optimization on decoupling the dependencies of instructions, the system could be much easier to deal with huge amount of data.

VLIW/EPIC

Very long instruction word (VLIW) or Explicitly parallel instruction computing (EPIC) is a CPU architecture that parallels CPU instructions by compile-time dependency graphs analysis. The most representative model is Intel Itanium family.

Itanium L3

Unlike OOE, VLIW/EPIC requires static analysis on dependencies, this is very hard to implement, since the cache situation of the CPU is unpredictable. But serialized instructions run very slow on these models. Unlike the vectorization instruction set, VLIW/EPIC tried to solve all computing proplems instead of problems in specific area.

In an interview, Donald Knuth said "The Itanium approach...was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write."

In 2017, Intel finally anounced that they gave up development in Itanium CPUs. This led to the end of VLIW/EPIC development, and Itanium's end of life with additional orders accepted until January 2020 and last shipments no later than July 2021.

References

  1. Kukunas, Jim (2015). Power and Performance: Software Analysis and Optimization. Morgan Kaufman. p. 37. ISBN 9780128008140.
  2. Donald E. Knuth, Andrew Binstock (2008) Interview with Donald Knuth https://www.informit.com/articles/article.aspx?p=1193856
  3. "Select Intel Itanium Processors and Intel Scalable Memory Buffer, PCN 116733-00, Product Discontinuance, End of Life" (PDF). Intel. January 30, 2019. https://qdms.intel.com/dm/i.aspx/F65EEA26-13FB-4580-972B-46B75E0AB322/PCN116733-00.pdf
  4. LLVM Auto-Vectorization in LLVM https://llvm.org/docs/Vectorizers.html

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2019-06-13 (木) 14:36:32 (859d)