Sign in to follow this  
ynm

Adapteva IC has 50+GFLOPS per watt, can graphics benefit from it?

Recommended Posts

ynm    172
Hi,

Here is the link, Adapteva claimed 70 GFLOPS per watt, better than Nvidia or AMD flagship GPU:
http://www.adapteva.com/white-papers/ten-myths-debunked-by-the-epiphany-iv-64-core-accelerator-chip/

100 GFLOPS under 2 watts, imagine this IC integrated into mobile platform, then it will have a huge calculator. They can produce it at 199$, 64 core, 100 GFLOPs. Will this kind of IC help the graphics field?

Regards

Share this post


Link to post
Share on other sites
Hodgman    51324
To not get carried away, that list of myths smells of marketing straw-man arguments...

That said, it looks cool. The idea of having lots of fast little CPU's, with no hardware caches or global RAM chips, and instead just a small dedicated bank of RAM directly connected to each core, is the same design used by the PS3's SPUs. IMHO, these kinds of designs are the future of parallel computing.
The PS3's GPU is known to be really, really outdated... but PS3 games continue to look pretty decent because programmers are able to implement a lot of graphical techniques on the SPU's. So yes, this kind of processor would be very, very useful.

I would really love it if every PC had one of these in it to accelerate heavy computation... but it's hard to market it. Old/existing software doesn't benefit from it, and new software has to be written specifically for it ([i]and why would I bother writing my software for it, if no customers yet have one[/i]), so there's a real chicken-and-egg problem when trying to turn this into a product.

Share this post


Link to post
Share on other sites
Bacterius    13165
Yeah, what I would really like to see is a modular accelerator card where you can just plug in more and more execution units as you get them, instead of buying whole new hardware every two/three years, kind of like how memory sticks work. That would be interesting.

Share this post


Link to post
Share on other sites
Waterlimon    4398
[quote name='Bacterius' timestamp='1353824853' post='5003895']
Yeah, what I would really like to see is a modular accelerator card where you can just plug in more and more execution units as you get them, instead of buying whole new hardware every two/three years, kind of like how memory sticks work. That would be interesting.
[/quote]

Yay for scale model of your minecraft house built using processor- and RAM-cubes!

Share this post


Link to post
Share on other sites
Cornstalks    7030
[quote name='Hodgman' timestamp='1353822980' post='5003891']
I would really love it if every PC had one of these in it to accelerate heavy computation... but it's hard to market it. Old/existing software doesn't benefit from it, and new software has to be written specifically for it ([i]and why would I bother writing my software for it, if no customers yet have one[/i]), so there's a real chicken-and-egg problem when trying to turn this into a product.
[/quote]
Would something like OpenCL or OpenMP be able to resolve this (the issue of having to write specifically for it)? I haven't used either, so I have no clue.

Share this post


Link to post
Share on other sites
6677    1054
OpenCL would be able to use that thing pretty nicely but as far as I can tell that thing is a standalone device not a PC expansion. Software already running on it should at least be able to use OpenCL nicely.

Share this post


Link to post
Share on other sites
TheChubu    9454
[quote name='Waterlimon' timestamp='1353861609' post='5003956']
Yay for scale model of your minecraft house built using processor- and RAM-cubes!
[/quote]I vote for having a voxel world in which each voxel runs on its own CPU :P

Share this post


Link to post
Share on other sites
Hodgman    51324
[quote name='Cornstalks' timestamp='1353862261' post='5003957']
[quote name='Hodgman' timestamp='1353822980' post='5003891']
I would really love it if every PC had one of these in it to accelerate heavy computation... but it's hard to market it. Old/existing software doesn't benefit from it, and new software has to be written specifically for it ([i]and why would I bother writing my software for it, if no customers yet have one[/i]), so there's a real chicken-and-egg problem when trying to turn this into a product.
[/quote]
Would something like OpenCL or OpenMP be able to resolve this (the issue of having to write specifically for it)? I haven't used either, so I have no clue.
[/quote]"Compute" type languages would be a better fit than OpenMP, but yes, software written for either of them would have a better chance of being instantly portable. You're still facing the problem that most software isn't written using OpenCL though [img]http://public.gamedev.net//public/style_emoticons/default/wink.png[/img]

Share this post


Link to post
Share on other sites
MJP    19786
C'mon guys! With their architecture you get 800MHz * 64 cores which gives you a total of [url="http://74.220.215.219/~adapteva/wp-content/uploads/2012/08/technology_table.png"]51.2 GHz[/url]!!!! That's wayyyyy more than the puny 16 GHz you get from a GPU!

If that doesn't sell you on their tech, then this mind-blowing [url="http://www.youtube.com/watch?feature=player_embedded&v=4sMWbaV1sRQ"]performance benchmark[/url] is sure to do the trick. Edited by MJP

Share this post


Link to post
Share on other sites
alh420    5995
[quote name='MJP' timestamp='1353909475' post='5004119']
C'mon guys! With their architecture you get 800MHz * 64 cores which gives you a total of [url="http://74.220.215.219/~adapteva/wp-content/uploads/2012/08/technology_table.png"]51.2 GHz[/url]!!!! That's wayyyyy more than the puny 16 GHz you get from a GPU!

If that doesn't sell you on their tech, then this mind-blowing [url="http://www.youtube.com/watch?feature=player_embedded&v=4sMWbaV1sRQ"]performance benchmark[/url] is sure to do the trick.
[/quote]

I'd like to see a comparison with a GPU implementation instead of a CPU implementation. Of course it's a lot faster then the CPU. Though it was actually less lightning fast then I'd expect. But then again, just 2W :)
I don't doubt their tech is cool and useful, but it's rarely fair to compare cores and "GHz" across different architecture types. Edited by Olof Hedman

Share this post


Link to post
Share on other sites
powly k    657
Might be cool for mobile stuff because of the power usage, but come on... 102GFLOPS compared to desktop GPUs that got into the TFLOPS region years ago.

Share this post


Link to post
Share on other sites
Kyall    287
They claim it's C/C++ programmable. (Going off month old memory here) If that's for legits this thing is awesome.

Also I have one of their epiphany III boards coming in May next year, just checked. Had forgotten about that. Ask me then if it can be programmed with C/C++ then. (I'm talking on the cores, not the arms)

Share this post


Link to post
Share on other sites
6677    1054
With a decent C compiler you should be able to program that thing in C easily. Although managing multi tasking across 64 cores yourself, let alone their planned 1024 core model.....

Share this post


Link to post
Share on other sites
Daaark    3553
[quote name='Hodgman' timestamp='1353822980' post='5003891']
To not get carried away, that list of myths smells of marketing straw-man arguments...
[/quote]This looks like one of those projects that only aims to separate naive early investors from their cash. Like the Phantom, and that fake super fast internet (that was a really faked with a hidden computer under a desk), and countless others.

WTF is a "C++ programmable processor"? That's just 3 buzzwords in a nonsense sentence. It gives people who don't know any better a big boner though.

Share this post


Link to post
Share on other sites
6677    1054
Any processor is C++ programmable, just needs someone to write a compiler. Hell, someone wrote a BASIC to brainf*** compiler/converter, I'm pretty sure you could do a C++ to BF aswell and thats an 8 instruction interpreted language. There are C to z80 compilers and C++ to C converters, not exactly far fetched to say that C++ on the z80 would be possible.

I can clearly see what you are thinking along the scam lines. I must say though, I am still interested in seeing how well it turns out.

Share this post


Link to post
Share on other sites
Ravyne    14300
Daark,

Marketing is all bullshit -- chances are if one reads the marketing hype for any high-tech device, one would label it all vaporware. Parellela is already shipping parts into the embedded space. The kickstarter is to produce a low-cost dev-board, with a stretch goal on offer to fund tap-out of their higher-end 64-core version. I funded for 2 boards myself, although I was really hoping for a single 64-core board in the event of the stretch goal being hit.

Hodgman has already compared it to the PS3s SPU, and its basically that -- except its a scaler unit rather than vector, and has less memory. It can be programmed akin to a GPU and achieve good performance, but unlike a modern GPU, each compute unit is completely independent (it doesn't share local memory, and its not part of a vector instruction). In theory, if you had a number of these that was comparably to a number of compute units in a GPU, normalized for clock speed, it should perform comparably at most scalable GPU workloads (less those that rely on any dedicated hardware the GPU has available), and it should be able to achieve high performance at workloads that are difficult or impossible to efficiently map onto a vector machine (i.e. problems that are big, but aren't necessarily parellel, or parallel workloads in which vectors diverge).

No one's made the claim that this is a better unit for graphics or even for the types of embarrassingly-parallel workloads that map easily to GPUs, but it has some interesting properties, and no solution that I know of delivers that kind of performance at just 2 watts. Creative's Zii processor is similar in concept, but different in structure, and that technology was just bought up by Intel, so I think that says something.

AFAIK, you can program the thing is C, and I believe there's an OpenCL compiler as well (or at least being worked on). As Hodgman also said, this kind of thing maps well to compute-style languages like OpenCL, DirectCompute, or C++ AMP, but not to OpenMP-style parallel programming.

Share this post


Link to post
Share on other sites
Daaark    3553
[quote name='6677' timestamp='1354055679' post='5004709']
Any processor is C++ programmable,[/quote]No processors are C++ programmable. That's not a feature of any processor. That's a feature of a compiler program that turns the contents of specially formatted plain text files into machine code.

With this logic, any processor is a 'car run-over-able processor' because it can be thrown in a street.

It's s nonsense sentence to fill space, and excite a certain demographic.

[quote]"What? This is C++ programmable? My IT dept knows that C++ thing, whatever it is."[/quote]

[quote name='Ravyne' timestamp='1354055867' post='5004711']
Marketing is all bullshit[/quote]Of course. But there are different kinds, and different levels. Sometimes the marketing bullshit has a real product behind it. Sometimes it's a confidence trick. I didn't say it was an investor scam, however the style of writing on the sites reeks of it. Maybe it's unintentional.

Share this post


Link to post
Share on other sites
Hodgman    51324
[quote name='Daaark' timestamp='1354047776' post='5004653']
WTF is a "C++ programmable processor"? That's just 3 buzzwords in a nonsense sentence. It gives people who don't know any better a big boner though.
[/quote]It's pretty obvious that that one means "there's a C++ compiler for it".
For example, your GeForce 3, your Wii's GPU, or your Agea PhysX accelerator board aren't C++ programmable... They're too limited for general purpose work.

Even modern GPUs are pretty bad at running C++, because it's "abstract machine" doesn't map well to the hardware.

To be fair, if you just take your C++ game and run it on one of these, the performance will be horrible at first, because ideally you want as much of your memory accesses to be local to the current node, and your code at the moment probably doesn't fit in 2MiB of space. It will work, because you can still address off-node memory, it will just be slow.
These kinds of designs are usually used to run "jobs", where your engine puts work into a queue, and checks it's status later. Each "job" of work should know in advance which areas of memory it will need to access, so that at the beginning of the job, they can all be DMA'ed to the target node, and upon completion, the results can be DMAed back to host memory ([i]and while the job is running, all memory accesses are on the node, being equivalent to never getting a L1 cache miss -- insanely good performance!![/i]).

The performance of this design also depends on whether each node has it's own DMA controller. Ideally, you want to split a node's memory in half and restrict jobs to that limit. Then, while a job is running, the DMA controller can be brining in the data for the next job into the other half, so there's no downtime at all between jobs, and the CPU never, ever, has to stall on memory. If there isn't a memory controller that can be used like this, then you'll end up stalling between jobs as the node's data is prepared. Ideally, their "[i]multicore framework[/i]" would handle all the job/DMA implementation details for you, but also give you the ability to bypass it and do the raw work yourself if desired.

If you've been writing an engine for the PS3, this won't be a problem for you... but if you've been writing for x86, then transitioning to this model will be a [b]lot[/b] of work.

n.b. even though on x86, you don't have to use this kind of "job" design where you're explicit about memory regions, if you do design your engine this way, you can really cut down on your L2 cache-miss counts and boost your performance a lot. It's simply a good way of programming.
[quote name='Ravyne' timestamp='1354055867' post='5004711']
Hodgman has already compared it to the PS3s SPU, and its basically that -- except its a scaler unit rather than vector, and has less memory.
[/quote]It's actually got more per node memory -- the SPUs only have 256KiB, and if you're running a job management system, then you've necessarily got less than that.

Wait, I just re-checked their site, and the [url="http://www.adapteva.com/white-papers/ten-myths-debunked-by-the-epiphany-iv-64-core-accelerator-chip/"]comparison grid[/url] shows 2MiB, but [url="http://www.adapteva.com/products/epiphany-ip/epiphany-architecture-ip/"]this page[/url] says up to 128KiB, and [url="http://www.adapteva.com/introduction/"]this page[/url] says 1MiB? Are these numbers from different products? Edited by Hodgman

Share this post


Link to post
Share on other sites
Ravyne    14300
[quote name='Hodgman' timestamp='1354067820' post='5004783']
[quote name='Ravyne' timestamp='1354055867' post='5004711']
Hodgman has already compared it to the PS3s SPU, and its basically that -- except its a scaler unit rather than vector, and has less memory.
[/quote]It's actually got more per node memory -- the SPUs only have 256KiB, and if you're running a job management system, then you've necessarily got less than that.

Wait, I just re-checked their site, and the [url="http://www.adapteva.com/white-papers/ten-myths-debunked-by-the-epiphany-iv-64-core-accelerator-chip/"]comparison grid[/url] shows 2MiB, but [url="http://www.adapteva.com/products/epiphany-ip/epiphany-architecture-ip/"]this page[/url] says up to 128KiB, and [url="http://www.adapteva.com/introduction/"]this page[/url] says 1MiB? Are these numbers from different products?
[/quote]

Yeah, their literature is less than clear sometimes unless you wade through it. The 2MiB number refers to the sum of all core memories (32 KiB/core) in the 64-core model. The 16-core model in the kickstarter boards also have 32KiB. They plan for a 64-core/128KiB model, followed by a 1024-core/128KiB model, and an 1024-core/1MiB model after that.

IIRC, the architecture allocates up to 1MiB per core, and the architecture is meant to scale up to 4096 cores ultimately. Each core can also access the memory of other nodes, since they share an address space, it'll just be slower. I'm not sure whether the additional latency of doing so is constant or per-hop, but I believe the topology is that each node connects to its 4 cardinal neighbors.

Another interesting aside is that the floating point ISA is fairly complete, but the integer ISA is fairly minimalist (no divide, or even multiply I think). Still the aggregate integer bench scores are competitive with higher-end x86 CPUs. Edited by Ravyne

Share this post


Link to post
Share on other sites
Kyall    287
I think a better way to describe it is that as an architecture this thing works like a normal processor, it's not a specialized SIMD or MIMD like a graphics processor. From that point of view it is c++ programmable, where as GPU will only support HLSL or OpenGL or OpenCL or what have you because it's si/mi-md rather than a traditional random get/set memory access architecture.

Share this post


Link to post
Share on other sites
6677    1054
[quote name='Daaark' timestamp='1354056856' post='5004719']
No processors are C++ programmable. That's not a feature of any processor. That's a feature of a compiler program that turns the contents of specially formatted plain text files into machine code.
[/quote]
Wow your dense, read the entire post. I said
[quote name='6677' timestamp='1354055679' post='5004709']
Any processor is C++ programmable, just needs someone to write a compiler.
[/quote]

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this