GPU NOR/NAND Gate using Fragment Shader's Dot Product

Started by
3 comments, last by Ohforf sake 10 years ago

Sorry it's a link but it will be worth your time.

For quick look at source code consider looking at:

https://digdigrpg.googlecode.com/svn/trunk/GLSLNOR.7z

Source code is for Linux and you need GLEW to compile.

Article: http://jinjuyu.blog.me/40207343365

It means it's a MASSIVE NUMBERS OF SOFTWARE NOR GATE!!!

NAND Gate:

http://jinjuyu.blog.me/40208164821

Advertisement

NAND Gate

http://jinjuyu.blog.me/40208164821

This is interesting but don't modern graphics cards (= the ones you would use for number crunching) already have unified FP/integer ALU's where bitwise operations can be done natively?

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

This is interesting but don't modern graphics cards (= the ones you would use for number crunching) already have unified FP/integer ALU's where bitwise operations can be done natively?

MFU(Multi Function Unit) is SLOW AS HELL. Only using DP4 it is Abundant and Fast.

This is interesting but don't modern graphics cards (= the ones you would use for number crunching) already have unified FP/integer ALU's where bitwise operations can be done natively?


Actually, NVidia cards are rather slow at integer arithmetic. According to the cuda documentation, a Kepler SMX can do 192 floating point operations (like add, mul or mad) per cycle, but only 160 integer add/sub and bitwise and/or/xor etc. Integer mul and shift is as slow as 32 operations per cycle.

This is why ATI/AMD cards are better suited for cryptographic stuff like bitcoin mining or burteforcing.


I didn't really read the links the op posted so the following might be totally off topic, but I think there is a misconception here about DP4. The GeForce 7000 series was the last NVidia GPU that did SIMD, and AMD/ATI followed shortly after. Today, an DP4 is 1 FMUL followed by 3 dependent FMADD. So it's not 1 cycle. It has a throughput of 1/4 per cycle and alu if properly pipelined and a latency of 32 cycles (assuming 8 cycles per operation). So 192 float ALUs with 1 DP4 every 4 cycles yields 48 logical operations per cycle and SMX. If the 160 int ALUs were used instead, you would get 32 logical operations per alu and cycle yielding 5120 logical operations per cycle and SMX, outperforming the DP4 approach by more than a factor of 100.


Edit: Just read the first part of the link and I think there is another even bigger misconception. The assumption, that the GPU will execute the entire fragment program for all pixels of the image in ONE CYCLE no matter the dimensions of said image or the length of the fragment program, is ... how do I put this ... incorrect. If it were the case, then yes, any GPU could emulate hardware gates in software at arbitrary speeds as described in the posted link, thereby outperforming even their own hardware (paradox alert). But it isn't.

This topic is closed to new replies.

Advertisement