https://en.wikipedia.org/wiki/Fugaku_(supercomputer)
https://www.r-ccs.riken.jp/jp/fugaku
https://www.r-ccs.riken.jp/en/fugaku/project
https://www.fujitsu.com/global/products/computing/servers/supercomputer/specifications/

Taking a quick look at Fugaku...K was an impressive engineering
accomplishment, but the tech seemed conservative: 45nm process, modest
number of cores per node, even for the day. Fugaku is bolder: 7nm
process, 48 cores/chip, SIMD vector instructions but no GPU.

K was SPARC, Fugaku is ARM.
Processor is Fujitsu A64FX
https://en.wikipedia.org/wiki/Fujitsu_A64FX
A 2018 talk on the processor:
http://www.hotchips.org/hc30/2conf/2.13_Fujitsu_HC30.Fujitsu.Yoshida.rev1.2.pdf
Looks like it's a seven-stage pipeline?

https://www.youtube.com/watch?v=Dh5qqL7JDdI

(From chart at 9:27 in that talk.)
Incredible memory bandwidth: 1024GB/s v. 64GB/s.
Denser packaging: 384 nodes/rack v. 96.
A whole petaFLOPS per rack!
Must be 414 racks so far, based on 158976 / 384?

Interesting chart (16:05 in that talk). The end of Moore's Law seems
not to have affected supercomputing as much as one might think; #
cores and # nodes goes up to compensate slowing increase in node
speed. *INCREDIBLE* work on interconnects and parallelism in software
to stay on this pace.

20:00:
Benchmarks so far:
* Top500 LINPACK
  https://www.top500.org/
  https://www.top500.org/project/linpack/
  carefully constrained benchmark; can't use shortcuts that you might
  in an actual application, such as Strassen's method or iterative
  refinement methods.
  reference source for HPL: http://www.netlib.org/benchmark/hpl/
  (HPL uses MPI)
  Paper describing the benchmark:
  https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.728
  Uses LU decomposition and partial pivot to solve a system of linear
  equations.  Algorithm is O(n^3); specifically, 2/3n^3+2n^2+O(n).
  https://en.wikipedia.org/wiki/LU_decomposition (similar to Gaussian
  elimination)
* Green 500 (first to top Green 500 without GPUs?)
* HPCG (high performance gradient solver) real app, not an artificial
  benchmark; highly dependent on memory bandwidth.
* Graph 500 (title held by K for a long time), 3 sub-benchmarks.
  In the benchmark, the graph with 64 mega vertices is the "toy" benchmark!
  The "huge" one is 4 tera nodes.
  https://en.wikipedia.org/wiki/Graph500
  https://graph500.org/
  Fugaku #1 in BFS kernel (sub-benchmark)
  http://graph500.org/?page_id=834
  info on the benchmark: http://graph500.org/?page_id=12
  reference implementation: http://graph500.org/?page_id=47 or
  https://github.com/graph500/graph500
* HPL-AI mixed-precision math, but still heavy on FP and matrices
  https://icl.bitbucket.io/hpl-ai/
  reference source: https://bitbucket.org/icl/hpl-ai/src/master/

23:50: too cute.
Top500 is 100 meter dash
Graph500 is figure skating
HPCG is marathon
HPL-AI is chess

At 24:20, he talks about "new rankings next week", but there is no
*subject* in his sentence, so I'm not sure which ones he was talking
about.  Should have come out last week.

Fugaku is Tofu D interconnect.
https://en.wikipedia.org/wiki/Torus_fusion
https://www.fujitsu.com/global/Images/the-tofu-interconnect-d-for-supercomputer-fugaku.pdf
Technically 6-D torus, but to MPI software looks like 3-D.

One rack is 2x4x4x2x3x2 or 2x2x8x2x3x2 torus.  Max config is
32x32x32x2x3x2 = 393,216 nodes, max.

System is water cooled, according to Fujitsu website.
https://www.fujitsu.com/global/products/computing/servers/supercomputer/specifications/
https://www.fujitsu.com/global/products/computing/servers/supercomputer/documents/
This doc is decent:
https://www.fujitsu.com/downloads/SUPER/primehpc-fx1000-hard-en.pdf

=====

Scroll back in time to June 2011:
https://www.top500.org/lists/top500/2011/06/
K was the top super then.  More than 2x the number of cores of any
other system on the list; rather power-hungry.

On the LINPACK benchmark, it achieved 90% of its theoretical maximum,
showing how well balanced the overall system is.  The #2 system,
Tianhe-1A, achieved less than 60% of its theoretical max.
Interconnect no doubt made a huge difference.  Software may also have
figured in, but given the number of excellent software people in
China, I would _guess_ that it was tuned about as well as possible.

K's peak performance was solving a system of 11,870,208 equations
(Nmax, in the LLINPACK lingo).  Compare that to Nmax for Fugaku,
20,459,520.

Fugaku uses up to 28 megawatts of power.
System currently sits at 7,299,072 cores and 4,866,048 GB of HBM (high
bandwidth memory).

Currently only achieving about 80% of theoretical max.  Not clear to
me if further tuning will help, or whether the system balance is what
it is.  As noted above, I think the node tech is more aggressive with
Fugaku than K, and so the interconnect may be the current bottleneck.