It's pretty obvious that that one means "there's a C++ compiler for it".
WTF is a "C++ programmable processor"? That's just 3 buzzwords in a nonsense sentence. It gives people who don't know any better a big boner though.
For example, your GeForce 3, your Wii's GPU, or your Agea PhysX accelerator board aren't C++ programmable... They're too limited for general purpose work.
Even modern GPUs are pretty bad at running C++, because it's "abstract machine" doesn't map well to the hardware.
To be fair, if you just take your C++ game and run it on one of these, the performance will be horrible at first, because ideally you want as much of your memory accesses to be local to the current node, and your code at the moment probably doesn't fit in 2MiB of space. It will work, because you can still address off-node memory, it will just be slow.
These kinds of designs are usually used to run "jobs", where your engine puts work into a queue, and checks it's status later. Each "job" of work should know in advance which areas of memory it will need to access, so that at the beginning of the job, they can all be DMA'ed to the target node, and upon completion, the results can be DMAed back to host memory (and while the job is running, all memory accesses are on the node, being equivalent to never getting a L1 cache miss -- insanely good performance!!).
The performance of this design also depends on whether each node has it's own DMA controller. Ideally, you want to split a node's memory in half and restrict jobs to that limit. Then, while a job is running, the DMA controller can be brining in the data for the next job into the other half, so there's no downtime at all between jobs, and the CPU never, ever, has to stall on memory. If there isn't a memory controller that can be used like this, then you'll end up stalling between jobs as the node's data is prepared. Ideally, their "multicore framework" would handle all the job/DMA implementation details for you, but also give you the ability to bypass it and do the raw work yourself if desired.
If you've been writing an engine for the PS3, this won't be a problem for you... but if you've been writing for x86, then transitioning to this model will be a lot of work.
n.b. even though on x86, you don't have to use this kind of "job" design where you're explicit about memory regions, if you do design your engine this way, you can really cut down on your L2 cache-miss counts and boost your performance a lot. It's simply a good way of programming.
It's actually got more per node memory -- the SPUs only have 256KiB, and if you're running a job management system, then you've necessarily got less than that.
Hodgman has already compared it to the PS3s SPU, and its basically that -- except its a scaler unit rather than vector, and has less memory.
Wait, I just re-checked their site, and the comparison grid shows 2MiB, but this page says up to 128KiB, and this page says 1MiB? Are these numbers from different products?
Yeah, their literature is less than clear sometimes unless you wade through it. The 2MiB number refers to the sum of all core memories (32 KiB/core) in the 64-core model. The 16-core model in the kickstarter boards also have 32KiB. They plan for a 64-core/128KiB model, followed by a 1024-core/128KiB model, and an 1024-core/1MiB model after that.
IIRC, the architecture allocates up to 1MiB per core, and the architecture is meant to scale up to 4096 cores ultimately. Each core can also access the memory of other nodes, since they share an address space, it'll just be slower. I'm not sure whether the additional latency of doing so is constant or per-hop, but I believe the topology is that each node connects to its 4 cardinal neighbors.
Another interesting aside is that the floating point ISA is fairly complete, but the integer ISA is fairly minimalist (no divide, or even multiply I think). Still the aggregate integer bench scores are competitive with higher-end x86 CPUs.