This is interesting but don't modern graphics cards (= the ones you would use for number crunching) already have unified FP/integer ALU's where bitwise operations can be done natively?

Actually, NVidia cards are rather slow at integer arithmetic. According to the cuda documentation, a Kepler SMX can do 192 floating point operations (like add, mul or mad) per cycle, but only 160 integer add/sub and bitwise and/or/xor etc. Integer mul and shift is as slow as 32 operations per cycle.

This is why ATI/AMD cards are better suited for cryptographic stuff like bitcoin mining or burteforcing.

I didn't really read the links the op posted so the following might be totally off topic, but I think there is a misconception here about DP4. The GeForce 7000 series was the last NVidia GPU that did SIMD, and AMD/ATI followed shortly after. Today, an DP4 is 1 FMUL followed by 3 dependent FMADD. So it's not 1 cycle. It has a throughput of 1/4 per cycle and alu if properly pipelined and a latency of 32 cycles (assuming 8 cycles per operation). So 192 float ALUs with 1 DP4 every 4 cycles yields 48 logical operations per cycle and SMX. If the 160 int ALUs were used instead, you would get 32 logical operations per alu and cycle yielding 5120 logical operations per cycle and SMX, outperforming the DP4 approach by more than a factor of 100.

Edit: Just read the first part of the link and I think there is another even bigger misconception. The assumption, that the GPU will execute the entire fragment program for all pixels of the image in ONE CYCLE no matter the dimensions of said image or the length of the fragment program, is ... how do I put this ... incorrect. If it were the case, then yes, any GPU could emulate hardware gates in software at arbitrary speeds as described in the posted link, thereby outperforming even their own hardware (paradox alert). But it isn't.

**Edited by Ohforf sake, 26 March 2014 - 10:34 AM.**