AFAIK, based on public comment, both new-generation consoles support the equivilent of AMD's HSA -- that is, a single physical address space for CPU and GPU processes. This eliminates a lot of the overhead associated with shuffling less memory-or-compute-intensive processes over to the GPU from CPU-land.
Which isn't anything new in the console space; X360 you could fiddle with GPU visable memory from the CPU and on the PS3 the SPUs could touch both system and VRAM components with ease (and the GPU could pull from system, althought that was a bit slow). HSA might be Big News in the desktop world but aside from a bit of fiddling on startup with memory pages/addresses it's pretty old hat on the consoles.
None of which sidesteps the issues of going from SPUs to compute I mentioned;
- a single SPU could chew through plenty of work on it's own, but launch a work group on the GPU with less than 64 threads and whoops, there goes tons of ALU time on that CU and unless you launch multiple groups of multiples of 64 work units (or have enough work groups in flight on the CU) then you can't do latency hiding for memory access...
- which brings me to point 2; SPUs let you issue DMA requests from main to 'local' memory and then work there. The nice thing was you can kick off a number of these requests up front, do some ALU work and then wait until data turns up to continue getting effectively free ALU cycles. I've used this to great effect doing SH on the PS3 where as soon as possible I'd issue a DMA load for data before doing non-dependant ops so that, by the time I needed the data, it was loaded (or nearly loaded).
Which brings up a fun aspect of them; you knew you had X amount of memory to play with and could manage it how you wanted. You could take advantage of memory space wrapping, alasing buffers over each other (I need this 8K now, but once I'm done I'll need 4K for something else, I've some time in between so I'll just reuse that chunk) and generally knowing what it is you want to do.
SPUs, while seeming complicated, where great things and as much as I'm a GPU Compute fan for large parallel workloads the fact is there are workloads just 'throwing it at the gpu' doesn't suit and something which is a halfway house between a CPU and GPU is perfectly suited.