https://en.wikipedia.org/wiki/Fugaku_(supercomputer) https://www.r-ccs.riken.jp/jp/fugaku https://www.r-ccs.riken.jp/en/fugaku/project https://www.fujitsu.com/global/products/computing/servers/supercomputer/specifications/ Taking a quick look at Fugaku...K was an impressive engineering accomplishment, but the tech seemed conservative: 45nm process, modest number of cores per node, even for the day. Fugaku is bolder: 7nm process, 48 cores/chip, SIMD vector instructions but no GPU. K was SPARC, Fugaku is ARM. Processor is Fujitsu A64FX https://en.wikipedia.org/wiki/Fujitsu_A64FX A 2018 talk on the processor: http://www.hotchips.org/hc30/2conf/2.13_Fujitsu_HC30.Fujitsu.Yoshida.rev1.2.pdf Looks like it's a seven-stage pipeline? https://www.youtube.com/watch?v=Dh5qqL7JDdI (From chart at 9:27 in that talk.) Incredible memory bandwidth: 1024GB/s v. 64GB/s. Denser packaging: 384 nodes/rack v. 96. A whole petaFLOPS per rack! Must be 414 racks so far, based on 158976 / 384? Interesting chart (16:05 in that talk). The end of Moore's Law seems not to have affected supercomputing as much as one might think; # cores and # nodes goes up to compensate slowing increase in node speed. *INCREDIBLE* work on interconnects and parallelism in software to stay on this pace. 20:00: Benchmarks so far: * Top500 LINPACK https://www.top500.org/ https://www.top500.org/project/linpack/ carefully constrained benchmark; can't use shortcuts that you might in an actual application, such as Strassen's method or iterative refinement methods. reference source for HPL: http://www.netlib.org/benchmark/hpl/ (HPL uses MPI) Paper describing the benchmark: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.728 Uses LU decomposition and partial pivot to solve a system of linear equations. Algorithm is O(n^3); specifically, 2/3n^3+2n^2+O(n). https://en.wikipedia.org/wiki/LU_decomposition (similar to Gaussian elimination) * Green 500 (first to top Green 500 without GPUs?) * HPCG (high performance gradient solver) real app, not an artificial benchmark; highly dependent on memory bandwidth. * Graph 500 (title held by K for a long time), 3 sub-benchmarks. In the benchmark, the graph with 64 mega vertices is the "toy" benchmark! The "huge" one is 4 tera nodes. https://en.wikipedia.org/wiki/Graph500 https://graph500.org/ Fugaku #1 in BFS kernel (sub-benchmark) http://graph500.org/?page_id=834 info on the benchmark: http://graph500.org/?page_id=12 reference implementation: http://graph500.org/?page_id=47 or https://github.com/graph500/graph500 * HPL-AI mixed-precision math, but still heavy on FP and matrices https://icl.bitbucket.io/hpl-ai/ reference source: https://bitbucket.org/icl/hpl-ai/src/master/ 23:50: too cute. Top500 is 100 meter dash Graph500 is figure skating HPCG is marathon HPL-AI is chess At 24:20, he talks about "new rankings next week", but there is no *subject* in his sentence, so I'm not sure which ones he was talking about. Should have come out last week. Fugaku is Tofu D interconnect. https://en.wikipedia.org/wiki/Torus_fusion https://www.fujitsu.com/global/Images/the-tofu-interconnect-d-for-supercomputer-fugaku.pdf Technically 6-D torus, but to MPI software looks like 3-D. One rack is 2x4x4x2x3x2 or 2x2x8x2x3x2 torus. Max config is 32x32x32x2x3x2 = 393,216 nodes, max. System is water cooled, according to Fujitsu website. https://www.fujitsu.com/global/products/computing/servers/supercomputer/specifications/ https://www.fujitsu.com/global/products/computing/servers/supercomputer/documents/ This doc is decent: https://www.fujitsu.com/downloads/SUPER/primehpc-fx1000-hard-en.pdf ===== Scroll back in time to June 2011: https://www.top500.org/lists/top500/2011/06/ K was the top super then. More than 2x the number of cores of any other system on the list; rather power-hungry. On the LINPACK benchmark, it achieved 90% of its theoretical maximum, showing how well balanced the overall system is. The #2 system, Tianhe-1A, achieved less than 60% of its theoretical max. Interconnect no doubt made a huge difference. Software may also have figured in, but given the number of excellent software people in China, I would _guess_ that it was tuned about as well as possible. K's peak performance was solving a system of 11,870,208 equations (Nmax, in the LLINPACK lingo). Compare that to Nmax for Fugaku, 20,459,520. Fugaku uses up to 28 megawatts of power. System currently sits at 7,299,072 cores and 4,866,048 GB of HBM (high bandwidth memory). Currently only achieving about 80% of theoretical max. Not clear to me if further tuning will help, or whether the system balance is what it is. As noted above, I think the node tech is more aggressive with Fugaku than K, and so the interconnect may be the current bottleneck.