Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


Don't forget to read Tuesday's email newsletter for your chance to win a free copy of Construct 2!


float or double depth buffer?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
31 replies to this topic

#1 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 16 June 2014 - 01:31 PM

im mending and rewriting my software rasterizer (this one screenshot i presented

before :

 

tie3.png

 

 and i wonder if i better should use float or double for values in depthbuffer - got no idea, could someone hint something?



Sponsor:

#2 Madhed   Crossbones+   -  Reputation: 3076

Like
14Likes
Like

Posted 16 June 2014 - 01:33 PM

try both, then make an informed decision



#3 SeanMiddleditch   Members   -  Reputation: 6407

Like
4Likes
Like

Posted 16 June 2014 - 01:48 PM

Madhed is right. It's easy enough to just try both and compare output quality vs execution time to make your own informed decision based on the specific domain/application you're working with.

You might also consider fixed-point and integral buffers, not entirely uncommon in hardware. Also, a lot of tuning can go into figuring out _what_ you store in your depth buffer (normalized z? w? etc.) rather than just _how_ it's stored.

#4 JohnnyCode   Members   -  Reputation: 266

Like
2Likes
Like

Posted 16 June 2014 - 01:56 PM

in case of software rasterizer, you should use native floating format in first place. On many systems it is the 64bit floating number, not 32bit floating number. Falling to 32bit on such systems where 64bit is native, might result even in performance decrease (I have once myself set 16 bit depth buffer on a 24S8 prefered gpu, and fps got halved!)

 

How to find out what system-hardware likes and is optimized for is another tale though. Dx offers things such as depth-format-check as I remember well, but in case of software rasterizer, you will need something else.



#5 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 16 June 2014 - 02:01 PM

Madhed is right. It's easy enough to just try both and compare output quality vs execution time to make your own informed decision based on the specific domain/application you're working with.

You might also consider fixed-point and integral buffers, not entirely uncommon in hardware. Also, a lot of tuning can go into figuring out _what_ you store in your depth buffer (normalized z? w? etc.) rather than just _how_ it's stored.

alright youre plobabry right can do them both

 

also would like to know some hints for general optymization, now Im just using strightforward scanline approach (wandering on triangle edges from up to down and drawing each scanline with depth),

probably some more eleborate tehniques are avaliable but I m getting lost in this (when readed some threads people using more optymized approaches), didint understand what shoukd i try at start 



#6 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 16 June 2014 - 02:10 PM

in case of software rasterizer, you should use native floating format in first place. On many systems it is the 64bit floating number, not 32bit floating number. Falling to 32bit on such systems where 64bit is native, might result even in performance decrease (I have once myself set 16 bit depth buffer on a 24S8 prefered gpu, and fps got halved!)

 

How to find out what system-hardware likes and is optimized for is another tale though. Dx offers things such as depth-format-check as I remember well, but in case of software rasterizer, you will need something else.

 

Im using mingw compiler and old core2duo processor right now (on 32 bit xp), curiously previous version of this rasterizer i compiled with old borland compiler (i was very accustomed to it so i used it a couple of years) so it was weakly generated (as bcc32 is a compiler made by 2000 or something) and as far as i remember thiose model above give about 10fps there (i mean compiled by bcc32 and runned on core2duo) , now i want to revrite it to mingw and optymize it as far as i could - I could touch assembly (which I know sadly weak ) if i would know what really i should do, but got no idea 


Edited by fir, 16 June 2014 - 02:11 PM.


#7 JohnnyCode   Members   -  Reputation: 266

Like
0Likes
Like

Posted 16 June 2014 - 02:56 PM

you may setup a 16 bit depth buffer but the operation itself will perform on a 32bit floating number after all (or 64 bit one) needing you to convert and interpret the 2 bytes. This 16 bit reduction can still be reasonable if you want to save cache coherency, or compute short integers. but cache coherency of 2d arrays (render targets or texture samplers) is usualy optimized for 4 byte or 8 byte storage atoms.

 

The cache coherency in enormous 2d arrays is mainly achieved by smart subdividing the areas, and after fetching or writing, continueing all halted threads that demanded the seeked memory operation, and then continuing threads whose memory operations are close to current cache burst. So the atom size is quite marginal against good managing of memory and threads that ask for memory.

 

You are writing a software raterizer, but you still should construct your rasterizer to acount for many (many many) cores available, like if it was to run on a gpu. Cache coherency is a crutial thing, if you achive good cache flowing, you will see impossible speed boost.

 

 



#8 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 16 June 2014 - 04:09 PM

you may setup a 16 bit depth buffer but the operation itself will perform on a 32bit floating number after all (or 64 bit one) needing you to convert and interpret the 2 bytes. This 16 bit reduction can still be reasonable if you want to save cache coherency, or compute short integers. but cache coherency of 2d arrays (render targets or texture samplers) is usualy optimized for 4 byte or 8 byte storage atoms.

 

The cache coherency in enormous 2d arrays is mainly achieved by smart subdividing the areas, and after fetching or writing, continueing all halted threads that demanded the seeked memory operation, and then continuing threads whose memory operations are close to current cache burst. So the atom size is quite marginal against good managing of memory and threads that ask for memory.

 

You are writing a software raterizer, but you still should construct your rasterizer to acount for many (many many) cores available, like if it was to run on a gpu. Cache coherency is a crutial thing, if you achive good cache flowing, you will see impossible speed boost.

but i got no blind idea how to obtaijn that cache coherency

 

right now i got just an array of traingles which im looping on and transform -> project -> rsterize with scanline approach to the frame buffer in a very  strightforward way.  Not even works right now as i rebuild prewious half spagetti into a bit more tidy system, but after a two or three days from now I will mend it, profile it more and try  to think what can do to rebuild my strightforward approach ino something a bit quicker if possible


Edited by fir, 16 June 2014 - 04:11 PM.


#9 JohnnyCode   Members   -  Reputation: 266

Like
0Likes
Like

Posted 16 June 2014 - 06:38 PM

but i got no blind idea how to obtaijn that cache coherency

for example if you are reading some outer pointer memory (not cached on stack) per every cycle (pixel) while reading or writing into a big (2d) array of pixels, you have then just reduced speed by multiple times. Can be 10 or 100 times.



#10 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 17 June 2014 - 02:12 AM

 

but i got no blind idea how to obtaijn that cache coherency

for example if you are reading some outer pointer memory (not cached on stack) per every cycle (pixel) while reading or writing into a big (2d) array of pixels, you have then just reduced speed by multiple times. Can be 10 or 100 times.

 

here as an input i got an raw array of triangles (9 floats sadly as i got only white geometry not colorized)

 

as an output i draw scanline triangle to frame and depth buffer,

each triangle in general is jumping on those buffers here, though probably they are somewhat (if not strongly) coherent in the model file

 

in the middle there are 3d transfrormation of input triangle into eye space then 3d plane clipping , projection then 2d clipping - that would be all if i remember it correctly

 

- this all is somewhat coherent as to ram acces though not 100% coherrent in acces to frame and depth buffers - stil im worried that this middle calculation have bigger impact on this then ram flow, 

but im not sure

 

(when i mend it today or toomorow i will try to profile this a bit)



#11 Ohforf sake   Members   -  Reputation: 1832

Like
3Likes
Like

Posted 17 June 2014 - 03:16 AM

I second trying them both (float & double), especially since this seems to be an "experimental" project.
 

in case of software rasterizer, you should use native floating format in first place. On many systems it is the 64bit floating number, not 32bit floating number. Falling to 32bit on such systems where 64bit is native, might result even in performance decrease.


The Intel architecture might be the exception here. Modern Intel/Amd processors are considered to be 64-bit native, but as a (very approximate) rule of thumb, double precision arithmetic operations take twice as long as single precision operations (if you use vectorization). In addition, you can only keep half as many double precision values in your registers (again, assuming vectorization), you can fit only half as many double precision numbers into your cache, and your bus can only load or store those double precision numbers half as fast from/to memory. The latter two also hold for non vectorized code.


As for memory coherency, tiled rendering might help you out here. I believe, this is also used by mobile rendering hardware to deal with the slow memory on mobile devices. Since system memory is significantly slower then video memory, you might be facing the same problems. And after you have assigned the geometry to it's tiles, processing the tiles in a multithreaded way is absolutely straight forward.

#12 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 17 June 2014 - 04:21 AM

I second trying them both (float & double), especially since this seems to be an "experimental" project.
 

in case of software rasterizer, you should use native floating format in first place. On many systems it is the 64bit floating number, not 32bit floating number. Falling to 32bit on such systems where 64bit is native, might result even in performance decrease.


The Intel architecture might be the exception here. Modern Intel/Amd processors are considered to be 64-bit native, but as a (very approximate) rule of thumb, double precision arithmetic operations take twice as long as single precision operations (if you use vectorization). In addition, you can only keep half as many double precision values in your registers (again, assuming vectorization), you can fit only half as many double precision numbers into your cache, and your bus can only load or store those double precision numbers half as fast from/to memory. The latter two also hold for non vectorized code.


As for memory coherency, tiled rendering might help you out here. I believe, this is also used by mobile rendering hardware to deal with the slow memory on mobile devices. Since system memory is significantly slower then video memory, you might be facing the same problems. And after you have assigned the geometry to it's tiles, processing the tiles in a multithreaded way is absolutely straight forward.

 

right now it is 1threaded and i would like optymize 1 threaded (i was previously tried to run other my project on two cores and it worked nicely, practically 2x speedup but as it confuses a code a bit i would previously like to optymize 1core as far as i could)



#13 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 17 June 2014 - 05:08 PM

PS if someone would run the exe (of slightly different version of this rasterizer, i spoiled the colors camera mobement and maybe even a precision wuality of image. im not sure, but presently got no time to tweak it)

 

https://www.dropbox.com/s/6618w9uf10x5lor/tie50.zip

 

(win32 app)

 

tie50.jpg

 

it would be fine if someone could say how many miliseconds it takes there

on what machine

 

ps. idont know why my google chrome browseridentifies it as malware (maybe becouse stripping exe for size - as far as i can be sure it is no malware, this prog do not changest nothing on a system, this should not be also any kind of oldschool virus or something here, as far as i can be sure - but if someone could check it witk antyvir for sure it could be ok


Edited by fir, 17 June 2014 - 05:09 PM.


#14 Vilem Otte   Crossbones+   -  Reputation: 1466

Like
2Likes
Like

Posted 17 June 2014 - 05:58 PM

I've tried it - I can confirm that no malware or old school virus has been installed (or it overcame my AV here). Regarding speed, I got 22ms to 44ms on Haswell-based Core i3 (in laptop). So it was quite fast...

 

What was visible though, your depth test isn't precise, the parts on hull flickered, dunno why (maybe it was just on my machine, or it's just setup of projection matrix that has intentionally set up near and far plane to test the precision).


Edited by Vilem Otte, 17 June 2014 - 05:59 PM.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com


#15 fir   Members   -  Reputation: -460

Like
1Likes
Like

Posted 19 June 2014 - 06:26 PM

I've tried it - I can confirm that no malware or old school virus has been installed (or it overcame my AV here). Regarding speed, I got 22ms to 44ms on Haswell-based Core i3 (in laptop). So it was quite fast...

 

What was visible though, your depth test isn't precise, the parts on hull flickered, dunno why (maybe it was just on my machine, or it's just setup of projection matrix that has intentionally set up near and far plane to test the precision).

alright tnx for testing

 

i putted a new wersion under the same

link

 

 https://www.dropbox.com/s/6618w9uf10x5lor/tie50.zip

 

there i optymized it a bit, cpuld you maybe (or some other person too0 test it and say how miliseconds it is - and what core? 

 

i would like to optymize it as i said but i got no idea how to do it

 

pla.jpg

 

yet one thing is plastocnes of the lightning, i got no idea how to awoid this my simple flat shading is always unpleasently horribly plastic



#16 Samith   Members   -  Reputation: 2260

Like
2Likes
Like

Posted 19 June 2014 - 06:57 PM


yet one thing is plastocnes of the lightning, i got no idea how to awoid this my simple flat shading is always unpleasently horribly plastic

 

There are many, many lighting models you can experiment with to get better looking shading. Look up Cook-Torrance for specular highlights or Oren Nayar for diffuse lighting.

 

I ran your program. It runs at about 22ms (edit: more detailed info below) with the tie fighter taking up most of the screen. I've got an Intel Core i5 4670K.

 

Like Vilem Otte, I saw a lot of z-fighting artifacts. I know from your posting history that you have somewhat of a vendetta against near planes. Is it possible you're creating a projection matrix with an absurdly close near plane? If so, you're destroying tons of precision in your z-buffer. If you continue to want to put the near plane close to (or at) zero, you'll have to look into w-buffering, because z-buffering won't be practical.

 

EDIT: Pretty cool, though. A software rasterizer is one of the things I'd like to try writing someday. Anyway, here are some times for each specific exe you included in your download:

 

tie50: 29ms

tie51: 20ms

tie53: 21ms

tie54: 20ms


Edited by Samith, 19 June 2014 - 07:01 PM.


#17 Vilem Otte   Crossbones+   -  Reputation: 1466

Like
2Likes
Like

Posted 19 June 2014 - 07:19 PM

OK, I've tried the different versions (lowest time - no triangle on screen, highest time - closeup view, triangles all over the screen):

tie50.exe -> from 22ms to 44ms

tie51.exe -> from 19ms to 31ms

tie53.exe -> from 22ms to 43ms
tie53.exe -> from 19ms to 38ms

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com


#18 Bacterius   Crossbones+   -  Reputation: 9066

Like
2Likes
Like

Posted 19 June 2014 - 09:12 PM

I'm getting similar results to everyone else, 13ms to 45ms and averaging 25ms roughly (2nd generation i5 at 3.7GHz), single-threaded at 550x400 resolution (the default). Lots of artifacts at close range. It's pretty slow in fullscreen mode, which is to be expected for a software renderer, however I am seeing a lot of screen tearing - are you displaying each rendered scanline on the fly? Perhaps double-buffering could help with that, and might also simplify your pipeline.


The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#19 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 20 June 2014 - 02:01 AM

Allright much tnx for the test, 

 

tie54 is most recent one , (other are older ones that i didnt wanted to delete  but tie54 most interesting to test )

 

forgot that i got a 5 ms sleep there if you hold f1 that runs fullspeed,

 

pressing w turns it into wire (shaded and depth buffered though) which curiously takes twice slower as a triangles

 

holding space stops rotation x begins it

 

rendering looks better for native resolution of fullscreen (i got a shortcut for this

 

     if(control_pressed)     {
      if(key=='1') ChangeResolution(320, 200, 32, 75);
      if(key=='2') ChangeResolution(512, 384, 32, 75);
      if(key=='3') ChangeResolution(640, 480, 32, 75);
      if(key=='4') ChangeResolution(800, 600, 32, 75);
      if(key=='5') ChangeResolution(1024, 768, 32, 75);
      if(key=='6') ChangeResolution(1280, 1024, 32, 75);
     }
 

 

curiously on my old machine p4 with xp sp2 i could even changed this way resolution to 320x200 or exotic 512x384 now SP3 probably blocks it - very sad 

 

this prog looks better in higher resolutions but it was always nice to

run something in 320x200 ;/

 

 

as i said mouse wheel up you could toggle fullscreen or mouse wheel down togle camera mode/free muse mode though im not showing arrow od free mouse mode so it may be confusing (got no time)

holding left shift while mouse makes strafe camera movement, holding left control while mouse makes roll 

 

(ps dont press f6 as it makes window 4000x4000 then f4 will save framebuffer into raw 64MB file (that using infran view i could convert to bitmap/jpg )- i shoul will off it and mend slight mouse bugs but i wast testin other things so it is working chaotic - im not quite happy with outcome of this all i know i can get better feeling often than here) -

 

90% of the coolnes here is I am afriad becouse this amazing model made by some man I take it from blender models com (it is pure geometry all white, curious if i would take something with color but i cannot yet read color models)

 

 

 

as to z fighting - I dont know what you mean (im not to much experienced in 3d graphics (especially modern ) Im only doing something like computer graphics theory from the books from

the 80-ties 

 

I can got 3 or 4 types of artifacts here as far as i can be concious

 

artifacts

 

 1)  im not fan of the near plane (i could prefer do not use it) but

here im using it becouse it makes clipping very easy (it is somewhat dirty clipping), I maight set it to to much high level (its probably set to 10) by mistake - it doeas not have influeance on values setted to depth buffer as in depth i store just z distance as a float (decided float be better than double as its sufficient) - i also just throw away each triangle if even one vertex is rear behind this rear plane (so many near triangle will pop out) *

 

* if someone would help me to do proper clipping of such triangle intersecting near plane i would be thankfull now i just throw it out

 

2) im doing quick 'vartex depth' 'clipping' I mean before rasterize each triangle i just test if its 3 vertices are not behind the depth bufer 

if they are al behind i throw the whole triangle (it speeded the thing 

nearly 2x and gives only slight decrease of visiblity) - as far as i remember pressing v skips this clipping

 

i remember i saw some  wing middle faces flickering when rotate but im very sadly not sure what can be reason for this

 

3) there are x-y artifacts making the image 'grainy' in low resolution - but its looks much better in higher one

 

im not sure if i could eliminate it (without rendering to high resoluton and down averaging this) i would very like to do it but im not sure

 

when triangle is transformed and projected its float x, y coordinates are ordinary casted to ints (co if i got 0.1, 0.5 and 0.9, 0.7 vertices it goes all ordinarly to 0,0 point) maybe i should try to make some subpixel triangle rasterization here to acheive better effects at cheap cost - but got yet no idea how to do it - some advices could be helpfull i then could try it and show the results

 

4) maybe yet something, other source of artifacts but it does not come to my mind but possible , though nothing comes to my mind 

 

 

Im not sure of which kind of artifacts you mean - if thiose vanishing near triangles, some flickering further ones or xy 2d granularity in lowres

 

all in all the very plastic light pisses off /gets me down   the most

- looking for some cheap way of removing this plastic feeling

 

also will try the other than scanline approach of rasterisation by nick capens though a bit later - im working on this so probably i will ask yet a couple of question about possibly way of improving the things

 

tnx for testing


Edited by fir, 20 June 2014 - 02:09 AM.


#20 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 20 June 2014 - 04:11 AM

I'm getting similar results to everyone else, 13ms to 45ms and averaging 25ms roughly (2nd generation i5 at 3.7GHz), single-threaded at 550x400 resolution (the default). Lots of artifacts at close range. It's pretty slow in fullscreen mode, which is to be expected for a software renderer, however I am seeing a lot of screen tearing - are you displaying each rendered scanline on the fly? Perhaps double-buffering could help with that, and might also simplify your pipeline.

 

could you say maybe what you mean fullscreen and pretty slow? (what reslolution?) (i know that timer is not visible there but it should be about the same as  maksymalized to desktop)

you are gave me a numbers (about 2x faster than than my old core2) but not provided the client resolution -

(I got about 75 ms at 1200x1000 when traveling to make ship cover the most of the screen)






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS