Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


Adapteva IC has 50+GFLOPS per watt, can graphics benefit from it?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
20 replies to this topic

#1 ynm   Members   -  Reputation: 172

Like
0Likes
Like

Posted 24 November 2012 - 11:14 PM

Hi,

Here is the link, Adapteva claimed 70 GFLOPS per watt, better than Nvidia or AMD flagship GPU:
http://www.adapteva.com/white-papers/ten-myths-debunked-by-the-epiphany-iv-64-core-accelerator-chip/

100 GFLOPS under 2 watts, imagine this IC integrated into mobile platform, then it will have a huge calculator. They can produce it at 199$, 64 core, 100 GFLOPs. Will this kind of IC help the graphics field?

Regards

Sponsor:

#2 Hodgman   Moderators   -  Reputation: 31224

Like
3Likes
Like

Posted 24 November 2012 - 11:56 PM

To not get carried away, that list of myths smells of marketing straw-man arguments...

That said, it looks cool. The idea of having lots of fast little CPU's, with no hardware caches or global RAM chips, and instead just a small dedicated bank of RAM directly connected to each core, is the same design used by the PS3's SPUs. IMHO, these kinds of designs are the future of parallel computing.
The PS3's GPU is known to be really, really outdated... but PS3 games continue to look pretty decent because programmers are able to implement a lot of graphical techniques on the SPU's. So yes, this kind of processor would be very, very useful.

I would really love it if every PC had one of these in it to accelerate heavy computation... but it's hard to market it. Old/existing software doesn't benefit from it, and new software has to be written specifically for it (and why would I bother writing my software for it, if no customers yet have one), so there's a real chicken-and-egg problem when trying to turn this into a product.

#3 Bacterius   Crossbones+   -  Reputation: 9105

Like
0Likes
Like

Posted 25 November 2012 - 12:27 AM

Yeah, what I would really like to see is a modular accelerator card where you can just plug in more and more execution units as you get them, instead of buying whole new hardware every two/three years, kind of like how memory sticks work. That would be interesting.

The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#4 Waterlimon   Crossbones+   -  Reputation: 2602

Like
0Likes
Like

Posted 25 November 2012 - 10:40 AM

Yeah, what I would really like to see is a modular accelerator card where you can just plug in more and more execution units as you get them, instead of buying whole new hardware every two/three years, kind of like how memory sticks work. That would be interesting.


Yay for scale model of your minecraft house built using processor- and RAM-cubes!

o3o


#5 Cornstalks   Crossbones+   -  Reputation: 6991

Like
0Likes
Like

Posted 25 November 2012 - 10:51 AM

I would really love it if every PC had one of these in it to accelerate heavy computation... but it's hard to market it. Old/existing software doesn't benefit from it, and new software has to be written specifically for it (and why would I bother writing my software for it, if no customers yet have one), so there's a real chicken-and-egg problem when trying to turn this into a product.

Would something like OpenCL or OpenMP be able to resolve this (the issue of having to write specifically for it)? I haven't used either, so I have no clue.
[ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

#6 6677   Members   -  Reputation: 1058

Like
0Likes
Like

Posted 25 November 2012 - 11:49 AM

OpenCL would be able to use that thing pretty nicely but as far as I can tell that thing is a standalone device not a PC expansion. Software already running on it should at least be able to use OpenCL nicely.

#7 TheChubu   Crossbones+   -  Reputation: 4588

Like
0Likes
Like

Posted 25 November 2012 - 06:34 PM

Yay for scale model of your minecraft house built using processor- and RAM-cubes!

I vote for having a voxel world in which each voxel runs on its own CPU :P

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

 

My journals: dustArtemis ECS framework and Making a Terrain Generator


#8 Hodgman   Moderators   -  Reputation: 31224

Like
0Likes
Like

Posted 25 November 2012 - 07:52 PM


I would really love it if every PC had one of these in it to accelerate heavy computation... but it's hard to market it. Old/existing software doesn't benefit from it, and new software has to be written specifically for it (and why would I bother writing my software for it, if no customers yet have one), so there's a real chicken-and-egg problem when trying to turn this into a product.

Would something like OpenCL or OpenMP be able to resolve this (the issue of having to write specifically for it)? I haven't used either, so I have no clue.

"Compute" type languages would be a better fit than OpenMP, but yes, software written for either of them would have a better chance of being instantly portable. You're still facing the problem that most software isn't written using OpenCL though Posted Image

#9 MJP   Moderators   -  Reputation: 11624

Like
1Likes
Like

Posted 25 November 2012 - 11:57 PM

C'mon guys! With their architecture you get 800MHz * 64 cores which gives you a total of 51.2 GHz!!!! That's wayyyyy more than the puny 16 GHz you get from a GPU!

If that doesn't sell you on their tech, then this mind-blowing performance benchmark is sure to do the trick.

Edited by MJP, 25 November 2012 - 11:58 PM.


#10 Olof Hedman   Crossbones+   -  Reputation: 2911

Like
0Likes
Like

Posted 26 November 2012 - 05:26 AM

C'mon guys! With their architecture you get 800MHz * 64 cores which gives you a total of 51.2 GHz!!!! That's wayyyyy more than the puny 16 GHz you get from a GPU!

If that doesn't sell you on their tech, then this mind-blowing performance benchmark is sure to do the trick.


I'd like to see a comparison with a GPU implementation instead of a CPU implementation. Of course it's a lot faster then the CPU. Though it was actually less lightning fast then I'd expect. But then again, just 2W :)
I don't doubt their tech is cool and useful, but it's rarely fair to compare cores and "GHz" across different architecture types.

Edited by Olof Hedman, 26 November 2012 - 05:28 AM.


#11 powly k   Members   -  Reputation: 653

Like
0Likes
Like

Posted 26 November 2012 - 12:22 PM

Might be cool for mobile stuff because of the power usage, but come on... 102GFLOPS compared to desktop GPUs that got into the TFLOPS region years ago.

#12 Kyall   Members   -  Reputation: 287

Like
0Likes
Like

Posted 27 November 2012 - 04:07 AM

They claim it's C/C++ programmable. (Going off month old memory here) If that's for legits this thing is awesome.

Also I have one of their epiphany III boards coming in May next year, just checked. Had forgotten about that. Ask me then if it can be programmed with C/C++ then. (I'm talking on the cores, not the arms)
I say Code! You say Build! Code! Build! Code! Build! Can I get a woop-woop? Woop! Woop!

#13 6677   Members   -  Reputation: 1058

Like
0Likes
Like

Posted 27 November 2012 - 09:26 AM

With a decent C compiler you should be able to program that thing in C easily. Although managing multi tasking across 64 cores yourself, let alone their planned 1024 core model.....

#14 MrDaaark   Members   -  Reputation: 3555

Like
0Likes
Like

Posted 27 November 2012 - 02:22 PM

To not get carried away, that list of myths smells of marketing straw-man arguments...

This looks like one of those projects that only aims to separate naive early investors from their cash. Like the Phantom, and that fake super fast internet (that was a really faked with a hidden computer under a desk), and countless others.

WTF is a "C++ programmable processor"? That's just 3 buzzwords in a nonsense sentence. It gives people who don't know any better a big boner though.

#15 6677   Members   -  Reputation: 1058

Like
0Likes
Like

Posted 27 November 2012 - 04:34 PM

Any processor is C++ programmable, just needs someone to write a compiler. Hell, someone wrote a BASIC to brainf*** compiler/converter, I'm pretty sure you could do a C++ to BF aswell and thats an 8 instruction interpreted language. There are C to z80 compilers and C++ to C converters, not exactly far fetched to say that C++ on the z80 would be possible.

I can clearly see what you are thinking along the scam lines. I must say though, I am still interested in seeing how well it turns out.

#16 Ravyne   GDNet+   -  Reputation: 7885

Like
1Likes
Like

Posted 27 November 2012 - 04:37 PM

Daark,

Marketing is all bullshit -- chances are if one reads the marketing hype for any high-tech device, one would label it all vaporware. Parellela is already shipping parts into the embedded space. The kickstarter is to produce a low-cost dev-board, with a stretch goal on offer to fund tap-out of their higher-end 64-core version. I funded for 2 boards myself, although I was really hoping for a single 64-core board in the event of the stretch goal being hit.

Hodgman has already compared it to the PS3s SPU, and its basically that -- except its a scaler unit rather than vector, and has less memory. It can be programmed akin to a GPU and achieve good performance, but unlike a modern GPU, each compute unit is completely independent (it doesn't share local memory, and its not part of a vector instruction). In theory, if you had a number of these that was comparably to a number of compute units in a GPU, normalized for clock speed, it should perform comparably at most scalable GPU workloads (less those that rely on any dedicated hardware the GPU has available), and it should be able to achieve high performance at workloads that are difficult or impossible to efficiently map onto a vector machine (i.e. problems that are big, but aren't necessarily parellel, or parallel workloads in which vectors diverge).

No one's made the claim that this is a better unit for graphics or even for the types of embarrassingly-parallel workloads that map easily to GPUs, but it has some interesting properties, and no solution that I know of delivers that kind of performance at just 2 watts. Creative's Zii processor is similar in concept, but different in structure, and that technology was just bought up by Intel, so I think that says something.

AFAIK, you can program the thing is C, and I believe there's an OpenCL compiler as well (or at least being worked on). As Hodgman also said, this kind of thing maps well to compute-style languages like OpenCL, DirectCompute, or C++ AMP, but not to OpenMP-style parallel programming.

#17 MrDaaark   Members   -  Reputation: 3555

Like
-1Likes
Like

Posted 27 November 2012 - 04:54 PM

Any processor is C++ programmable,

No processors are C++ programmable. That's not a feature of any processor. That's a feature of a compiler program that turns the contents of specially formatted plain text files into machine code.

With this logic, any processor is a 'car run-over-able processor' because it can be thrown in a street.

It's s nonsense sentence to fill space, and excite a certain demographic.

"What? This is C++ programmable? My IT dept knows that C++ thing, whatever it is."


Marketing is all bullshit

Of course. But there are different kinds, and different levels. Sometimes the marketing bullshit has a real product behind it. Sometimes it's a confidence trick. I didn't say it was an investor scam, however the style of writing on the sites reeks of it. Maybe it's unintentional.

#18 Hodgman   Moderators   -  Reputation: 31224

Like
1Likes
Like

Posted 27 November 2012 - 07:57 PM

WTF is a "C++ programmable processor"? That's just 3 buzzwords in a nonsense sentence. It gives people who don't know any better a big boner though.

It's pretty obvious that that one means "there's a C++ compiler for it".
For example, your GeForce 3, your Wii's GPU, or your Agea PhysX accelerator board aren't C++ programmable... They're too limited for general purpose work.

Even modern GPUs are pretty bad at running C++, because it's "abstract machine" doesn't map well to the hardware.

To be fair, if you just take your C++ game and run it on one of these, the performance will be horrible at first, because ideally you want as much of your memory accesses to be local to the current node, and your code at the moment probably doesn't fit in 2MiB of space. It will work, because you can still address off-node memory, it will just be slow.
These kinds of designs are usually used to run "jobs", where your engine puts work into a queue, and checks it's status later. Each "job" of work should know in advance which areas of memory it will need to access, so that at the beginning of the job, they can all be DMA'ed to the target node, and upon completion, the results can be DMAed back to host memory (and while the job is running, all memory accesses are on the node, being equivalent to never getting a L1 cache miss -- insanely good performance!!).

The performance of this design also depends on whether each node has it's own DMA controller. Ideally, you want to split a node's memory in half and restrict jobs to that limit. Then, while a job is running, the DMA controller can be brining in the data for the next job into the other half, so there's no downtime at all between jobs, and the CPU never, ever, has to stall on memory. If there isn't a memory controller that can be used like this, then you'll end up stalling between jobs as the node's data is prepared. Ideally, their "multicore framework" would handle all the job/DMA implementation details for you, but also give you the ability to bypass it and do the raw work yourself if desired.

If you've been writing an engine for the PS3, this won't be a problem for you... but if you've been writing for x86, then transitioning to this model will be a lot of work.

n.b. even though on x86, you don't have to use this kind of "job" design where you're explicit about memory regions, if you do design your engine this way, you can really cut down on your L2 cache-miss counts and boost your performance a lot. It's simply a good way of programming.

Hodgman has already compared it to the PS3s SPU, and its basically that -- except its a scaler unit rather than vector, and has less memory.

It's actually got more per node memory -- the SPUs only have 256KiB, and if you're running a job management system, then you've necessarily got less than that.

Wait, I just re-checked their site, and the comparison grid shows 2MiB, but this page says up to 128KiB, and this page says 1MiB? Are these numbers from different products?

Edited by Hodgman, 27 November 2012 - 08:05 PM.


#19 Ravyne   GDNet+   -  Reputation: 7885

Like
0Likes
Like

Posted 27 November 2012 - 08:36 PM


Hodgman has already compared it to the PS3s SPU, and its basically that -- except its a scaler unit rather than vector, and has less memory.

It's actually got more per node memory -- the SPUs only have 256KiB, and if you're running a job management system, then you've necessarily got less than that.

Wait, I just re-checked their site, and the comparison grid shows 2MiB, but this page says up to 128KiB, and this page says 1MiB? Are these numbers from different products?


Yeah, their literature is less than clear sometimes unless you wade through it. The 2MiB number refers to the sum of all core memories (32 KiB/core) in the 64-core model. The 16-core model in the kickstarter boards also have 32KiB. They plan for a 64-core/128KiB model, followed by a 1024-core/128KiB model, and an 1024-core/1MiB model after that.

IIRC, the architecture allocates up to 1MiB per core, and the architecture is meant to scale up to 4096 cores ultimately. Each core can also access the memory of other nodes, since they share an address space, it'll just be slower. I'm not sure whether the additional latency of doing so is constant or per-hop, but I believe the topology is that each node connects to its 4 cardinal neighbors.

Another interesting aside is that the floating point ISA is fairly complete, but the integer ISA is fairly minimalist (no divide, or even multiply I think). Still the aggregate integer bench scores are competitive with higher-end x86 CPUs.

Edited by Ravyne, 27 November 2012 - 08:37 PM.


#20 Kyall   Members   -  Reputation: 287

Like
0Likes
Like

Posted 28 November 2012 - 07:57 PM

I think a better way to describe it is that as an architecture this thing works like a normal processor, it's not a specialized SIMD or MIMD like a graphics processor. From that point of view it is c++ programmable, where as GPU will only support HLSL or OpenGL or OpenCL or what have you because it's si/mi-md rather than a traditional random get/set memory access architecture.
I say Code! You say Build! Code! Build! Code! Build! Can I get a woop-woop? Woop! Woop!




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS