article link
The Nerfed FPU in PS5’s Zen 2 Cores
Fritzchens Fritz’s awesome die photos revealed that the Playstation 5’s Zen 2 cores spend less die area on their FPUs than their desktop cousins. Thanks to Brutus, I got to take a closer look at just how AMD got Zen 2’s FPU to drop from 0.91 to 0.59 mm2. Brutus specifically acquired and gave me access to AMD’s BC-250. The BC-250 uses a harvested PS5 chip with six enabled Zen 2 cores and a very cut down GPU, and is supposed to be used for crypto mining. That means I get access to the same PS5 style Zen 2 cores, and can target them with microbenchmarks and performance counters.
For this article, I’ll be referring to the BC-250’s Zen 2 cores as PS5 Zen 2 cores. It’s a lot easier to say PS5.
Overview
The FPU on the PS5’s Zen 2 cores occupies the same width along the short side of the core, but look very compressed on the other axis. There’s a lot less blank area, and less area looks allocated to the execution units on either side too.
Images from Fritzchens Fritz at and
But AMD didn’t get this area reduction for free. Instead, they cut down the FP pipes and eliminated some duplicate FP/vector execution units. Zen 2 nominally has a quad port FPU, with ports named FP0, FP1, FP2, and FP3. On the PS5, FP3 has been deleted. FP2 is relegated to only handling FP/vector stores, with all of its math execution units removed or moved to FP0/FP1. Long story short, AMD did this:
Figured out through microbenchmarking and performance monitoring events
With those changes, the PS5’s Zen 2 cores effectively have a dual port FPU with the ability to co-issue stores. But the execution units weren’t the only things to shrink in Fritzchens Fritz’s images. The register file in the center is visibly changed as well. While it’s divided into the same number of blocks, the blocks are both closer together and smaller.
Image created by Fritzchens Fritz at
However, microbenchmarking speculative FP/vector register file capacity indicates the PS5’s FPU continues to have Zen 2’s full 160 entry register file. Using 256-bit FP adds as filler instructions produces a similar result, so each entry is still 256 bits wide.
I suspect AMD was able to make the register file smaller because they could make do with less register bandwidth after cutting down the execution pipes. AMD noted register file area was more dependent on port count and width than capacity.
While they made that observation for Zen 4’s AVX-512 implementation, it’s likely equally valid for the PS5’s cut down Zen 2 FPU. Many of AMD’s FPU modifications mean the register file can feed the execution pipes with fewer ports.And even the physical register file, which can now hold 512-bit registers, had only minimal growth since register file area is mostly limited by the width and number of access ports, and not so much the number of storage cells per entry.
Kai Troester, on AMD’s Zen 4 core at Hot Chips 2023
Feeding Zen 2’s four execution pipes could require up to 10 inputs, but AMD’s optimization manual says FP3’s source buses are reused to provide a third input for the FMA units on FP0 and FP1. I interpret this to mean Zen 2’s register file only had eight ports to help keep register file area under control. Zen 2 PS5 edition would only need six register read ports to feed its FP pipes, further cutting down register file area.
Pipe Modification Result FP1 No longer has a FMA (fused multiply add) unit Can be fed with two register file reads instead of three FP2 Only handles FP/vector stores. All math units deleted or moved to other pipes Can be fed with a single register file read
Doesn’t need a write port because stores don’t generate a resultFP3 No longer exists Doesn’t need any read or write ports
My interpretation of Zen 2’s register file port count. Zen 2 shared FP3’s source busses with the two FMA pipes
The PS5’s cut down FPU would have 192 and 128 bytes per cycle of read and write bandwidth respectively, or 672 GB/s read and 448 GB/s write at 3.5 GHz. For comparison, Zen 2 has 256 and 192 bytes per cycle of read and write bandwidth. At 3.5 GHz, that would be 896 GB/s of read bandwidth, and 672 GB/s of write bandwidth.
AMD also left the scheduler and non-scheduling queue (NSQ) intact. Therefore, the FPU’s all-important latency hiding capabilities remain unchanged. Performance counters (count mask = 4 on the FP pipe assignment event) indicate the scheduler or NSQ can still accept four micro-ops per cycle from the renamer. So, the FP renamer hasn’t been cut down either.
Results from Daniel Lemire’s integer to string conversion benchmark, which takes advantage of new AVX-512 instructions
Zen 4 uses a similar strategy having less execution throughput paired with full-fat out-of-order structures to hide latency and keep it fed. We saw that strategy play out quite nicely, with Zen 4 getting a very decent performance uplift from AVX-512 despite having similar vector execution throughput to Zen 3.
But how that works out for Zen 2 is an interesting question. Cutting down execution hardware past a certain point can really hurt performance, just as going for an bargain basement GPU can drop you off a performance cliff.
More at the linkFinal Words
PS5’s Zen 2 cores represent an early AMD effort to reduce core area. They show that AMD is very capable of customizing their cores to meet customer demands, even if they don’t publicly advertise configuration options as Arm Ltd does. The cut down FPU in Zen 2 reminds me of Cortex A510’s ability to be configured with different FP pipe counts, letting customers make the performance and area tradeoff they want.
From Arm’s Cortex A510 Optimization Guide. The VPU 128-bit pipes are optional
I find myself liking the tradeoff AMD made for the PS5. They cut execution units that were unlikely to help for the PS5’s workloads. At the same time, they maintained the same number of FP register file, scheduler, and non-scheduling queue entries. Execution latencies were also unchanged. A game like CoD Cold War still needs to execute a few billion FPU operations per second. The cut down FPU is more than capable of handling that while its out of order structures absorb any temporary spikes in demand.
I suspect the PS5’s FPU configuration would be adequate even for a lot of consumers. A lot of applications don’t heavily exercise the FPU, and some that do (like SSIM calculation) can get by with minimal performance loss. Some heavier applications like Y-Cruncher do see a larger performance loss, but a 16.4% difference might not always be noticeable.
Images from Fritzchens Fritz. Scaling, labeling, and pixel counting done by Clam
Even though AMD has made millions of chips with nerfed FPUs for Sony, I haven’t heard of them repeating the strategy. I’m guessing that’s because cutting down the FPU doesn’t make enough of a difference to hit extra market segments. A 35% reduction in FPU area by itself is impressive. But the size of a quad core cluster only goes down by 5.8%. That’s not enough to enable a dramatic core count increase or a much smaller and cheaper die.
For Zen 4, AMD settled on a different area reduction strategy. Zen 4c leaves the architecture unchanged and targets lower clock speeds to reduce area. Remarkably, even the FPU and its full width 512-bit register file remain unchanged. Limiting Zen 4c to around 3.6 GHz let AMD use denser 6T SRAM for L1 caches, branch prediction storage, and translation caches. A smaller clock mesh and other optimizations let Zen 4c achieve a 35% area reduction for the entire core, not just the FPU. AMD continued by cutting L3 capacity in half, which made a lot of sense because the L3 takes up more area than the cores themselves on server and desktop Zen 2 compute dies (CCDs).
As a result, AMD was able to pack 16 cores into a die that’s just slightly larger than the standard 8 core Zen 4 CCD. That kind of change can open up new market segments, unlike shrinking Zen 2’s FPU in isolation.