Sign in to follow this  
???

GPU NOR/NAND Gate using Fragment Shader's Dot Product

Recommended Posts

???    9

Sorry it's a link but it will be worth your time.

For quick look at source code consider looking at:

https://digdigrpg.googlecode.com/svn/trunk/GLSLNOR.7z

Source code is for Linux and you need GLEW to compile.

 

Article: http://jinjuyu.blog.me/40207343365

 

 

It means it's a MASSIVE NUMBERS OF SOFTWARE NOR GATE!!!

 

 

 

NAND Gate:

 

http://jinjuyu.blog.me/40208164821

Edited by WalkingTimeBomb

Share this post


Link to post
Share on other sites
Bacterius    13165

This is interesting but don't modern graphics cards (= the ones you would use for number crunching) already have unified FP/integer ALU's where bitwise operations can be done natively?

Share this post


Link to post
Share on other sites
???    9

This is interesting but don't modern graphics cards (= the ones you would use for number crunching) already have unified FP/integer ALU's where bitwise operations can be done natively?

MFU(Multi Function Unit) is SLOW AS HELL. Only using DP4 it is Abundant and Fast.

Share this post


Link to post
Share on other sites
Ohforf sake    2052

This is interesting but don't modern graphics cards (= the ones you would use for number crunching) already have unified FP/integer ALU's where bitwise operations can be done natively?


Actually, NVidia cards are rather slow at integer arithmetic. According to the cuda documentation, a Kepler SMX can do 192 floating point operations (like add, mul or mad) per cycle, but only 160 integer add/sub and bitwise and/or/xor etc. Integer mul and shift is as slow as 32 operations per cycle.

This is why ATI/AMD cards are better suited for cryptographic stuff like bitcoin mining or burteforcing.


I didn't really read the links the op posted so the following might be totally off topic, but I think there is a misconception here about DP4. The GeForce 7000 series was the last NVidia GPU that did SIMD, and AMD/ATI followed shortly after. Today, an DP4 is 1 FMUL followed by 3 dependent FMADD. So it's not 1 cycle. It has a throughput of 1/4 per cycle and alu if properly pipelined and a latency of 32 cycles (assuming 8 cycles per operation). So 192 float ALUs with 1 DP4 every 4 cycles yields 48 logical operations per cycle and SMX. If the 160 int ALUs were used instead, you would get 32 logical operations per alu and cycle yielding 5120 logical operations per cycle and SMX, outperforming the DP4 approach by more than a factor of 100.


Edit: Just read the first part of the link and I think there is another even bigger misconception. The assumption, that the GPU will execute the entire fragment program for all pixels of the image in ONE CYCLE no matter the dimensions of said image or the length of the fragment program, is ... how do I put this ... incorrect. If it were the case, then yes, any GPU could emulate hardware gates in software at arbitrary speeds as described in the posted link, thereby outperforming even their own hardware (paradox alert). But it isn't. Edited by Ohforf sake

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this