[Chips and Cheese] The nerfed FPU in PS5's Zen 2 cores

anonpuffs

Veteran
Icon Extra
29 Nov 2022
8,329
9,561
article link

The Nerfed FPU in PS5’s Zen 2 Cores​

Fritzchens Fritz’s awesome die photos revealed that the Playstation 5’s Zen 2 cores spend less die area on their FPUs than their desktop cousins. Thanks to Brutus, I got to take a closer look at just how AMD got Zen 2’s FPU to drop from 0.91 to 0.59 mm2. Brutus specifically acquired and gave me access to AMD’s BC-250. The BC-250 uses a harvested PS5 chip with six enabled Zen 2 cores and a very cut down GPU, and is supposed to be used for crypto mining. That means I get access to the same PS5 style Zen 2 cores, and can target them with microbenchmarks and performance counters.
For this article, I’ll be referring to the BC-250’s Zen 2 cores as PS5 Zen 2 cores. It’s a lot easier to say PS5.

Overview​

The FPU on the PS5’s Zen 2 cores occupies the same width along the short side of the core, but look very compressed on the other axis. There’s a lot less blank area, and less area looks allocated to the execution units on either side too.
Images from Fritzchens Fritz at and
But AMD didn’t get this area reduction for free. Instead, they cut down the FP pipes and eliminated some duplicate FP/vector execution units. Zen 2 nominally has a quad port FPU, with ports named FP0, FP1, FP2, and FP3. On the PS5, FP3 has been deleted. FP2 is relegated to only handling FP/vector stores, with all of its math execution units removed or moved to FP0/FP1. Long story short, AMD did this:
Figured out through microbenchmarking and performance monitoring events
With those changes, the PS5’s Zen 2 cores effectively have a dual port FPU with the ability to co-issue stores. But the execution units weren’t the only things to shrink in Fritzchens Fritz’s images. The register file in the center is visibly changed as well. While it’s divided into the same number of blocks, the blocks are both closer together and smaller.
Image created by Fritzchens Fritz at
However, microbenchmarking speculative FP/vector register file capacity indicates the PS5’s FPU continues to have Zen 2’s full 160 entry register file. Using 256-bit FP adds as filler instructions produces a similar result, so each entry is still 256 bits wide.

I suspect AMD was able to make the register file smaller because they could make do with less register bandwidth after cutting down the execution pipes. AMD noted register file area was more dependent on port count and width than capacity.
And even the physical register file, which can now hold 512-bit registers, had only minimal growth since register file area is mostly limited by the width and number of access ports, and not so much the number of storage cells per entry.
Kai Troester, on AMD’s Zen 4 core at Hot Chips 2023
While they made that observation for Zen 4’s AVX-512 implementation, it’s likely equally valid for the PS5’s cut down Zen 2 FPU. Many of AMD’s FPU modifications mean the register file can feed the execution pipes with fewer ports.
PipeModificationResult
FP1No longer has a FMA (fused multiply add) unitCan be fed with two register file reads instead of three
FP2Only handles FP/vector stores. All math units deleted or moved to other pipesCan be fed with a single register file read
Doesn’t need a write port because stores don’t generate a result
FP3No longer existsDoesn’t need any read or write ports
Feeding Zen 2’s four execution pipes could require up to 10 inputs, but AMD’s optimization manual says FP3’s source buses are reused to provide a third input for the FMA units on FP0 and FP1. I interpret this to mean Zen 2’s register file only had eight ports to help keep register file area under control. Zen 2 PS5 edition would only need six register read ports to feed its FP pipes, further cutting down register file area.
My interpretation of Zen 2’s register file port count. Zen 2 shared FP3’s source busses with the two FMA pipes
The PS5’s cut down FPU would have 192 and 128 bytes per cycle of read and write bandwidth respectively, or 672 GB/s read and 448 GB/s write at 3.5 GHz. For comparison, Zen 2 has 256 and 192 bytes per cycle of read and write bandwidth. At 3.5 GHz, that would be 896 GB/s of read bandwidth, and 672 GB/s of write bandwidth.
AMD also left the scheduler and non-scheduling queue (NSQ) intact. Therefore, the FPU’s all-important latency hiding capabilities remain unchanged. Performance counters (count mask = 4 on the FP pipe assignment event) indicate the scheduler or NSQ can still accept four micro-ops per cycle from the renamer. So, the FP renamer hasn’t been cut down either.
Results from Daniel Lemire’s integer to string conversion benchmark, which takes advantage of new AVX-512 instructions
Zen 4 uses a similar strategy having less execution throughput paired with full-fat out-of-order structures to hide latency and keep it fed. We saw that strategy play out quite nicely, with Zen 4 getting a very decent performance uplift from AVX-512 despite having similar vector execution throughput to Zen 3.
But how that works out for Zen 2 is an interesting question. Cutting down execution hardware past a certain point can really hurt performance, just as going for an bargain basement GPU can drop you off a performance cliff.

Final Words​

PS5’s Zen 2 cores represent an early AMD effort to reduce core area. They show that AMD is very capable of customizing their cores to meet customer demands, even if they don’t publicly advertise configuration options as Arm Ltd does. The cut down FPU in Zen 2 reminds me of Cortex A510’s ability to be configured with different FP pipe counts, letting customers make the performance and area tradeoff they want.

From Arm’s Cortex A510 Optimization Guide. The VPU 128-bit pipes are optional
I find myself liking the tradeoff AMD made for the PS5. They cut execution units that were unlikely to help for the PS5’s workloads. At the same time, they maintained the same number of FP register file, scheduler, and non-scheduling queue entries. Execution latencies were also unchanged. A game like CoD Cold War still needs to execute a few billion FPU operations per second. The cut down FPU is more than capable of handling that while its out of order structures absorb any temporary spikes in demand.


I suspect the PS5’s FPU configuration would be adequate even for a lot of consumers. A lot of applications don’t heavily exercise the FPU, and some that do (like SSIM calculation) can get by with minimal performance loss. Some heavier applications like Y-Cruncher do see a larger performance loss, but a 16.4% difference might not always be noticeable.

Images from Fritzchens Fritz. Scaling, labeling, and pixel counting done by Clam
Even though AMD has made millions of chips with nerfed FPUs for Sony, I haven’t heard of them repeating the strategy. I’m guessing that’s because cutting down the FPU doesn’t make enough of a difference to hit extra market segments. A 35% reduction in FPU area by itself is impressive. But the size of a quad core cluster only goes down by 5.8%. That’s not enough to enable a dramatic core count increase or a much smaller and cheaper die.


For Zen 4, AMD settled on a different area reduction strategy. Zen 4c leaves the architecture unchanged and targets lower clock speeds to reduce area. Remarkably, even the FPU and its full width 512-bit register file remain unchanged. Limiting Zen 4c to around 3.6 GHz let AMD use denser 6T SRAM for L1 caches, branch prediction storage, and translation caches. A smaller clock mesh and other optimizations let Zen 4c achieve a 35% area reduction for the entire core, not just the FPU. AMD continued by cutting L3 capacity in half, which made a lot of sense because the L3 takes up more area than the cores themselves on server and desktop Zen 2 compute dies (CCDs).

As a result, AMD was able to pack 16 cores into a die that’s just slightly larger than the standard 8 core Zen 4 CCD. That kind of change can open up new market segments, unlike shrinking Zen 2’s FPU in isolation.
More at the link
 

historia

Veteran
Icon Extra
29 Jun 2023
2,818
2,719
It's kinda impressive since the cuts don't seem to affect gaming performance much if at all
It is actually more benificial because more cache required to feed the larger FPU.

I guess the "experts" were wrong again when they say PS5's CPU was starved on cache. Well those retards using notebook APU that have "similar" config to install Win and test CPU performance with full-ledged CPU. Professionally retarded.

Also PS5 don't have actual background tasks and processes like full-ledged PC OS so.
 

lynux3

Newbie
21 Apr 2023
19
21
It is actually more benificial because more cache required to feed the larger FPU.

I guess the "experts" were wrong again when they say PS5's CPU was starved on cache. Well those retards using notebook APU that have "similar" config to install Win and test CPU performance with full-ledged CPU. Professionally retarded.

Also PS5 don't have actual background tasks and processes like full-ledged PC OS so.
“Professionally retarded” is the perfect way to say it. It’s like saying you’re out of your element because all you do is visualize it without architecting it.
 
  • Like
Reactions: Kokoloko

Polyh3dron

Well-known member
31 Jan 2024
463
364
It is actually more benificial because more cache required to feed the larger FPU.

I guess the "experts" were wrong again when they say PS5's CPU was starved on cache. Well those retards using notebook APU that have "similar" config to install Win and test CPU performance with full-ledged CPU. Professionally retarded.

Also PS5 don't have actual background tasks and processes like full-ledged PC OS so.
Yup. I don’t understand how so many people don’t realize that a PC, especially one running the jankfest that is Windows, has a metric shit ton of processing overhead and inefficiency along with all game processing going through their DirectX API layer that makes CPU comparisons between a console and a PC for gaming specific workloads not exactly apples-to-apples.
 
  • they're_right_you_know
Reactions: anonpuffs