Why would devs be opposed to the PS4`s x86 architecture?

Started by
18 comments, last by _the_phantom_ 10 years, 2 months ago

Anyone care to explain "FLOPS"? Never heard that terminology before.

Advertisement

I'm guessing that the awesome PS3 CPU model (traditional CPU core paired with SPUs) was abandoned because we now have compute-shaders, which are almost the same thing.

Except for all the things you lose and the extra 'baggage' of 'must have 64 units of work in flight or you are wasting alu', must put thousands of work units in flight to hide latency (which you can't explicately cover with alu ops) and all manner of other things which makes me sad sad.png

I didn't say it was the right decision, but I can kind of understand that hypothetical reasoning.

SPU's do well at "compute" workloads, but yes, SPUs also do well at other types of workloads that "compute" hardware doesn't handle too well.
When I read the Cell documentation, I was pretty convinced that it was going to be the future of CPU's... maybe next, next-gen, again?

Anyone care to explain "FLOPS"? Never heard that terminology before.

Basically, math performance: http://en.wikipedia.org/wiki/FLOPS

Floating point operations per second. FLOPS.

EDIT: Ninja'd.

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator


X360's x86 architecture could perform 77 GFLOPS, the PS3 could perform 230 GFLOPS.


Just to nitpick, but the XBox 360 didn't have an x86 processor - it was a PowerPC just like the PS3 but with a different internal architecture. Perhaps you meant that the design of the Xenon CPU is more similar to a PowerPC version of common x86 chips than it is to the distributed design of the Core CPU?

Well this generation of games consoles isn't about the hardware. It is all about locking players down into the "cloud" and other online consumer services.

Any hardware can be used for this quite frankly.

I wonder if the use of slightly less exotic hardware will encourage or discourage the homebrew scene. Afterall, I believe the reason why we still don't yet have a playable emulator for the first Xbox is because "it's boring".

I find it most interesting that both heavyweights decided to choose this type of hardware at the same time. It is almost like as soon as they realized it would not be "embarrassing" to do so, they jumped on board the x86 bandwagon immediately.

http://tinyurl.com/shewonyay - Thanks so much for those who voted on my GF's Competition Cosplay Entry for Cosplayzine. She won! I owe you all beers :)

Mutiny - Open-source C++ Unity re-implementation.
Defile of Eden 2 - FreeBSD and OpenBSD binaries of our latest game.

I find it most interesting that both heavyweights decided to choose this type of hardware at the same time. It is almost like as soon as they realized it would not be "embarrassing" to do so, they jumped on board the x86 bandwagon immediately.

Power has lagged in R&D dollars, and also lacks SoC with a good GPU, so both vendors had to rule them out. MIPS also lacked a good GPU solution, and doesn't have as large a developer pool. ARM and x86 were the remaining choices, and rumor has it both got to prototype silicon. ARM only released ARMv8-A in 2011, so that might have even been too risky since the consoles obviously wanted 64-bit chips and that's a pretty tight schedule. But x86 apparently had the better performance on the synthetic tests anyways, so it probably wasn't a hard choice.


Except for all the things you lose and the extra 'baggage' of 'must have 64 units of work in flight or you are wasting alu', must put thousands of work units in flight to hide latency (which you can't explicately cover with alu ops) and all manner of other things which makes me sad

AFAIK, based on public comment, both new-generation consoles support the equivilent of AMD's HSA -- that is, a single physical address space for CPU and GPU processes. This eliminates a lot of the overhead associated with shuffling less memory-or-compute-intensive processes over to the GPU from CPU-land.

In GPGPU on the PC, before HSA, the process is like this: You have some input data you want to process and a kernel to process it->You make an API call that does some CPU-side security/validity checks and then forwards the request to the driver->The driver makes a shadow copy of the data into a block of memory that's assigned to the driver and changes the data alignment to be suitable for the GPU's DMA unit, then queues the DMA request->The GPU comes along and processes the DMA request, which copies the data from the shadow buffer into the GPU's physical memory over the (slow) PCIe bus along with the compute kernel->After the results are computed, if the data is needed back in CPU-land, the GPU has to DMA them back over PCIe and into the aligned shadow buffer->And finally, before the CPU can access them, the driver has to copy them back from the shadow copy in the driver's address space to the process address space.

HSA, and the new-generation consoles, are able to skip all the copying, shadow-buffers, DMA and PCIe bus stuff entirely. Moving the data practically free, and IIRC about equivalent to remapping a memory page in the worst-case scenario.

Having wide execution units running over code that diverges is still a problem--its always more difficult to keep many threads in sync than fewer, but traditional SIMD ought to be a reasonable substitute for that.

I will say though, that the trend of 8 "light" Jaguar cores was surprising to me too -- I very much expected to see 4 "fat" cores this generation instead. I worry that there will be scenarios that will bottle-neck on single-threaded performance, and which will be a challenge to retool in a thread-friendly manner.

throw table_exception("(? ???)? ? ???");

AFAIK, based on public comment, both new-generation consoles support the equivilent of AMD's HSA -- that is, a single physical address space for CPU and GPU processes. This eliminates a lot of the overhead associated with shuffling less memory-or-compute-intensive processes over to the GPU from CPU-land.


Which isn't anything new in the console space; X360 you could fiddle with GPU visable memory from the CPU and on the PS3 the SPUs could touch both system and VRAM components with ease (and the GPU could pull from system, althought that was a bit slow). HSA might be Big News in the desktop world but aside from a bit of fiddling on startup with memory pages/addresses it's pretty old hat on the consoles.

None of which sidesteps the issues of going from SPUs to compute I mentioned;
- a single SPU could chew through plenty of work on it's own, but launch a work group on the GPU with less than 64 threads and whoops, there goes tons of ALU time on that CU and unless you launch multiple groups of multiples of 64 work units (or have enough work groups in flight on the CU) then you can't do latency hiding for memory access...
- which brings me to point 2; SPUs let you issue DMA requests from main to 'local' memory and then work there. The nice thing was you can kick off a number of these requests up front, do some ALU work and then wait until data turns up to continue getting effectively free ALU cycles. I've used this to great effect doing SH on the PS3 where as soon as possible I'd issue a DMA load for data before doing non-dependant ops so that, by the time I needed the data, it was loaded (or nearly loaded).

Which brings up a fun aspect of them; you knew you had X amount of memory to play with and could manage it how you wanted. You could take advantage of memory space wrapping, alasing buffers over each other (I need this 8K now, but once I'm done I'll need 4K for something else, I've some time in between so I'll just reuse that chunk) and generally knowing what it is you want to do.

SPUs, while seeming complicated, where great things and as much as I'm a GPU Compute fan for large parallel workloads the fact is there are workloads just 'throwing it at the gpu' doesn't suit and something which is a halfway house between a CPU and GPU is perfectly suited.

Imagine you have just spent eight years and many million dollars developing a library of code centered around the Cell architecture. Would you be very happy to hear you need to throw it away?

Why would you throw it away? Sure some parts of it may be SPU-specific, but paralleizable code that work on small chunks of contiguous data is great on almost any architecture. Most studios aren't going to be doing it Naughty Dog style with piles of hand-written pipelined SPU assembly. Which of course makes sense why Mark said that it was first-parties that were skeptical of x86.


Imagine you have spent eight years hiring people, and your focus has been to include people who deeply understand the "supercomputer on a chip" design that Cell offered, which is most powerful when developers focus on the chip as a master processor with collection of slave processors, and now find that all those employees must go back to the x86 model. Would you be happy to hear that those employees will no longer be necessary?



So what, these mighty geniuses of programming are suddenly useless when given 8 x86 cores and a GPU? GPU compute in games is a largely unexplored field, and it's going to require smart people to figure out the best way to make use of it. And engines always need people that can take a pile of spaghetti gameplay code and turn it into something that can run across multiple cores without a billion cache misses.


The parallel architecture is a little more tricky to develop for, but considering all the supercomputers built out of old PS3s it should make you think. Building parallel algorithms where the work is partitioned across processors takes a lot more of the science aspect of programming, but the result is that you are trading serial time for parallel time and can potentially do significantly more work in the same wall-clock time. While many companies who focused on cross-platform designs take minimal advantage of the hardware, the companies who focused specifically on that system could do far more. Consider that in terms of raw floating point operations per second, X360's x86 architecture could perform 77 GFLOPS, the PS3 could perform 230 GFLOPS. It takes a bit more computer science application and system-specific coding to take advantage of it, but offering four times the raw processing power is a notable thing.

And yet for all of that processing power X360 games regularly outperformed their PS3 versions. What good is oodles of flops if it's not accessible to the average dev team? I for one am glad that Sony decided to take their head of the sand on this issue, and instead doubled down on making a system that made its power available to developers instead of hiding it away from them.


People who have spent their whole programming lives on the x86 platform don't really notice, and those who stick to single-threaded high level languages without relying on chip-specific functionally don't really notice, but the x86 family of processors really are awful compared to what else is out there. Yes the are general purpose and can do a lot, but other designs have a lot to offer. Consider how x86 does memory access: You request a single byte. That byte might or might not be in the cache, and might require a long time to load. There is no good way to request a block be fetched or maintained for frequent access. In many other processors you can map a block of memory for fast access and continuous cache use, and swap it out and replace it if you want to. The x86 family originated in the days of much slower memory and other systems were not powerful. On the x86 if you want to copy memory you might have a way to do a DMA transfer (tell the memory to copy directly from one memory point to another) but in practice that rarely happens; everything goes through the CPU. Compare this with many other systems where you can copy and transfer memory blocks in the background without having it travel across the entire motherboard. The very small number of CPU registers on x86 was often derided with it's paltry 8 registers, then later it's 16 registers up until the 64-bit extensions brought it up to a barely respectable 64 64-bit registers and 8 128-bit SIMD registers; competitors during the 16-register introduction often had 32 32-bit registers, and when the 64-bit extensions were introduced competitors offered 32-64-bit and 32-128-bit registers or more; in some cases offering 128+ 128-bit registers for your processing enjoyment. The x86 64-bit extensions helped ease a lot of stresses, but the x86 family at the assembly level absolutely shows its age since the core instructions are still based around the hardware concepts from the early 1970s rather than the physical realities of today's hardware. And on and on and on


Sure x86 is old and crusty, but that doesn't mean that AMD's x86 chips are necessarily bad because of it. In the context of the PS4's design constraints I can hardly see how it was a bad choice. Their box would be more expensive and would use more power if they'd stuck another Cell-like monster in a separate die instead of going with their SOC solution.

And yet for all of that processing power X360 games regularly outperformed their PS3 versions.


To be fair the SPUs were often used to make up for the utter garbage which was the PS3's GPU... god that thing sucked.

This topic is closed to new replies.

Advertisement