Adapteva IC has 50+GFLOPS per watt, can graphics benefit from it?

Started by
19 comments, last by 6677 11 years, 4 months ago
Might be cool for mobile stuff because of the power usage, but come on... 102GFLOPS compared to desktop GPUs that got into the TFLOPS region years ago.
Advertisement
They claim it's C/C++ programmable. (Going off month old memory here) If that's for legits this thing is awesome.

Also I have one of their epiphany III boards coming in May next year, just checked. Had forgotten about that. Ask me then if it can be programmed with C/C++ then. (I'm talking on the cores, not the arms)
I say Code! You say Build! Code! Build! Code! Build! Can I get a woop-woop? Woop! Woop!
With a decent C compiler you should be able to program that thing in C easily. Although managing multi tasking across 64 cores yourself, let alone their planned 1024 core model.....

To not get carried away, that list of myths smells of marketing straw-man arguments...
This looks like one of those projects that only aims to separate naive early investors from their cash. Like the Phantom, and that fake super fast internet (that was a really faked with a hidden computer under a desk), and countless others.

WTF is a "C++ programmable processor"? That's just 3 buzzwords in a nonsense sentence. It gives people who don't know any better a big boner though.
Any processor is C++ programmable, just needs someone to write a compiler. Hell, someone wrote a BASIC to brainf*** compiler/converter, I'm pretty sure you could do a C++ to BF aswell and thats an 8 instruction interpreted language. There are C to z80 compilers and C++ to C converters, not exactly far fetched to say that C++ on the z80 would be possible.

I can clearly see what you are thinking along the scam lines. I must say though, I am still interested in seeing how well it turns out.
Daark,

Marketing is all bullshit -- chances are if one reads the marketing hype for any high-tech device, one would label it all vaporware. Parellela is already shipping parts into the embedded space. The kickstarter is to produce a low-cost dev-board, with a stretch goal on offer to fund tap-out of their higher-end 64-core version. I funded for 2 boards myself, although I was really hoping for a single 64-core board in the event of the stretch goal being hit.

Hodgman has already compared it to the PS3s SPU, and its basically that -- except its a scaler unit rather than vector, and has less memory. It can be programmed akin to a GPU and achieve good performance, but unlike a modern GPU, each compute unit is completely independent (it doesn't share local memory, and its not part of a vector instruction). In theory, if you had a number of these that was comparably to a number of compute units in a GPU, normalized for clock speed, it should perform comparably at most scalable GPU workloads (less those that rely on any dedicated hardware the GPU has available), and it should be able to achieve high performance at workloads that are difficult or impossible to efficiently map onto a vector machine (i.e. problems that are big, but aren't necessarily parellel, or parallel workloads in which vectors diverge).

No one's made the claim that this is a better unit for graphics or even for the types of embarrassingly-parallel workloads that map easily to GPUs, but it has some interesting properties, and no solution that I know of delivers that kind of performance at just 2 watts. Creative's Zii processor is similar in concept, but different in structure, and that technology was just bought up by Intel, so I think that says something.

AFAIK, you can program the thing is C, and I believe there's an OpenCL compiler as well (or at least being worked on). As Hodgman also said, this kind of thing maps well to compute-style languages like OpenCL, DirectCompute, or C++ AMP, but not to OpenMP-style parallel programming.

throw table_exception("(? ???)? ? ???");


Any processor is C++ programmable,
No processors are C++ programmable. That's not a feature of any processor. That's a feature of a compiler program that turns the contents of specially formatted plain text files into machine code.

With this logic, any processor is a 'car run-over-able processor' because it can be thrown in a street.

It's s nonsense sentence to fill space, and excite a certain demographic.

"What? This is C++ programmable? My IT dept knows that C++ thing, whatever it is."[/quote]


Marketing is all bullshit
Of course. But there are different kinds, and different levels. Sometimes the marketing bullshit has a real product behind it. Sometimes it's a confidence trick. I didn't say it was an investor scam, however the style of writing on the sites reeks of it. Maybe it's unintentional.

WTF is a "C++ programmable processor"? That's just 3 buzzwords in a nonsense sentence. It gives people who don't know any better a big boner though.
It's pretty obvious that that one means "there's a C++ compiler for it".
For example, your GeForce 3, your Wii's GPU, or your Agea PhysX accelerator board aren't C++ programmable... They're too limited for general purpose work.

Even modern GPUs are pretty bad at running C++, because it's "abstract machine" doesn't map well to the hardware.

To be fair, if you just take your C++ game and run it on one of these, the performance will be horrible at first, because ideally you want as much of your memory accesses to be local to the current node, and your code at the moment probably doesn't fit in 2MiB of space. It will work, because you can still address off-node memory, it will just be slow.
These kinds of designs are usually used to run "jobs", where your engine puts work into a queue, and checks it's status later. Each "job" of work should know in advance which areas of memory it will need to access, so that at the beginning of the job, they can all be DMA'ed to the target node, and upon completion, the results can be DMAed back to host memory (and while the job is running, all memory accesses are on the node, being equivalent to never getting a L1 cache miss -- insanely good performance!!).

The performance of this design also depends on whether each node has it's own DMA controller. Ideally, you want to split a node's memory in half and restrict jobs to that limit. Then, while a job is running, the DMA controller can be brining in the data for the next job into the other half, so there's no downtime at all between jobs, and the CPU never, ever, has to stall on memory. If there isn't a memory controller that can be used like this, then you'll end up stalling between jobs as the node's data is prepared. Ideally, their "multicore framework" would handle all the job/DMA implementation details for you, but also give you the ability to bypass it and do the raw work yourself if desired.

If you've been writing an engine for the PS3, this won't be a problem for you... but if you've been writing for x86, then transitioning to this model will be a lot of work.

n.b. even though on x86, you don't have to use this kind of "job" design where you're explicit about memory regions, if you do design your engine this way, you can really cut down on your L2 cache-miss counts and boost your performance a lot. It's simply a good way of programming.

Hodgman has already compared it to the PS3s SPU, and its basically that -- except its a scaler unit rather than vector, and has less memory.
It's actually got more per node memory -- the SPUs only have 256KiB, and if you're running a job management system, then you've necessarily got less than that.

Wait, I just re-checked their site, and the comparison grid shows 2MiB, but this page says up to 128KiB, and this page says 1MiB? Are these numbers from different products?

[quote name='Ravyne' timestamp='1354055867' post='5004711']
Hodgman has already compared it to the PS3s SPU, and its basically that -- except its a scaler unit rather than vector, and has less memory.
It's actually got more per node memory -- the SPUs only have 256KiB, and if you're running a job management system, then you've necessarily got less than that.

Wait, I just re-checked their site, and the comparison grid shows 2MiB, but this page says up to 128KiB, and this page says 1MiB? Are these numbers from different products?
[/quote]

Yeah, their literature is less than clear sometimes unless you wade through it. The 2MiB number refers to the sum of all core memories (32 KiB/core) in the 64-core model. The 16-core model in the kickstarter boards also have 32KiB. They plan for a 64-core/128KiB model, followed by a 1024-core/128KiB model, and an 1024-core/1MiB model after that.

IIRC, the architecture allocates up to 1MiB per core, and the architecture is meant to scale up to 4096 cores ultimately. Each core can also access the memory of other nodes, since they share an address space, it'll just be slower. I'm not sure whether the additional latency of doing so is constant or per-hop, but I believe the topology is that each node connects to its 4 cardinal neighbors.

Another interesting aside is that the floating point ISA is fairly complete, but the integer ISA is fairly minimalist (no divide, or even multiply I think). Still the aggregate integer bench scores are competitive with higher-end x86 CPUs.

throw table_exception("(? ???)? ? ???");

I think a better way to describe it is that as an architecture this thing works like a normal processor, it's not a specialized SIMD or MIMD like a graphics processor. From that point of view it is c++ programmable, where as GPU will only support HLSL or OpenGL or OpenCL or what have you because it's si/mi-md rather than a traditional random get/set memory access architecture.
I say Code! You say Build! Code! Build! Code! Build! Can I get a woop-woop? Woop! Woop!

This topic is closed to new replies.

Advertisement