Jump to content

  • Log In with Google      Sign In   
  • Create Account

Interested in a FREE copy of HTML5 game maker Construct 2?

We'll be giving away three Personal Edition licences in next Tuesday's GDNet Direct email newsletter!

Sign up from the right-hand sidebar on our homepage and read Tuesday's newsletter for details!


We're also offering banner ads on our site from just $5! 1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


GPU NOR/NAND Gate using Fragment Shader's Dot Product


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
4 replies to this topic

#1 WalkingTimeBomb   Members   -  Reputation: 4

Like
0Likes
Like

Posted 02 March 2014 - 11:48 AM

Sorry it's a link but it will be worth your time.

For quick look at source code consider looking at:

https://digdigrpg.googlecode.com/svn/trunk/GLSLNOR.7z

Source code is for Linux and you need GLEW to compile.

 

Article: http://jinjuyu.blog.me/40207343365

 

 

It means it's a MASSIVE NUMBERS OF SOFTWARE NOR GATE!!!

 

 

 

NAND Gate:

 

http://jinjuyu.blog.me/40208164821


Edited by WalkingTimeBomb, 10 March 2014 - 06:03 PM.


Sponsor:

#2 WalkingTimeBomb   Members   -  Reputation: 4

Like
0Likes
Like

Posted 10 March 2014 - 06:04 PM

NAND Gate

http://jinjuyu.blog.me/40208164821



#3 Bacterius   Crossbones+   -  Reputation: 9055

Like
1Likes
Like

Posted 10 March 2014 - 07:35 PM

This is interesting but don't modern graphics cards (= the ones you would use for number crunching) already have unified FP/integer ALU's where bitwise operations can be done natively?


The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#4 WalkingTimeBomb   Members   -  Reputation: 4

Like
0Likes
Like

Posted 25 March 2014 - 12:54 PM

This is interesting but don't modern graphics cards (= the ones you would use for number crunching) already have unified FP/integer ALU's where bitwise operations can be done natively?

MFU(Multi Function Unit) is SLOW AS HELL. Only using DP4 it is Abundant and Fast.



#5 Ohforf sake   Members   -  Reputation: 1832

Like
4Likes
Like

Posted 26 March 2014 - 10:08 AM

This is interesting but don't modern graphics cards (= the ones you would use for number crunching) already have unified FP/integer ALU's where bitwise operations can be done natively?


Actually, NVidia cards are rather slow at integer arithmetic. According to the cuda documentation, a Kepler SMX can do 192 floating point operations (like add, mul or mad) per cycle, but only 160 integer add/sub and bitwise and/or/xor etc. Integer mul and shift is as slow as 32 operations per cycle.

This is why ATI/AMD cards are better suited for cryptographic stuff like bitcoin mining or burteforcing.


I didn't really read the links the op posted so the following might be totally off topic, but I think there is a misconception here about DP4. The GeForce 7000 series was the last NVidia GPU that did SIMD, and AMD/ATI followed shortly after. Today, an DP4 is 1 FMUL followed by 3 dependent FMADD. So it's not 1 cycle. It has a throughput of 1/4 per cycle and alu if properly pipelined and a latency of 32 cycles (assuming 8 cycles per operation). So 192 float ALUs with 1 DP4 every 4 cycles yields 48 logical operations per cycle and SMX. If the 160 int ALUs were used instead, you would get 32 logical operations per alu and cycle yielding 5120 logical operations per cycle and SMX, outperforming the DP4 approach by more than a factor of 100.


Edit: Just read the first part of the link and I think there is another even bigger misconception. The assumption, that the GPU will execute the entire fragment program for all pixels of the image in ONE CYCLE no matter the dimensions of said image or the length of the fragment program, is ... how do I put this ... incorrect. If it were the case, then yes, any GPU could emulate hardware gates in software at arbitrary speeds as described in the posted link, thereby outperforming even their own hardware (paradox alert). But it isn't.

Edited by Ohforf sake, 26 March 2014 - 10:34 AM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS