# L. Spiro

Member

4316

25657 Excellent

• Rank
Crossbones+

• Interests
|programmer|
1. ## How To Convert double To float

When dealing with normalized numbers, you just need to make the exponents the same value, which is just a matter of getting the real exponent value from the source by taking the exponent integer and applying the bias, then converting to the new set of bits with the new bias. If the exponents are the same, the rest of the number is converted just by shaving bits off the mantissa. You can round the number by adding the highest bit shaved off. My original implementation had the sign, exponent, and mantissa as separate numbers, so my implementation would have produced incorrect results when the mantissa was all F's and a 1 was added from the shaved bits. My mantissa would wrap to 0 but the exponent wouldn't have increased by 1 as it should. So now the problem with denormalized numbers can be made more general and refer instead to any case where the exponents in the source and destination are different (which will always be the case for denormalized numbers with different amounts of bits for the exponent). The most common case is with denormals, so I will go a bit into that. Denormalized numbers lose the implicit 1 on the mantissa. For normalized cases the mantissa always increases the value based off the exponent (it is always Exponent * [1, 1.999999...]) whereas for denormalized cases the number always decreases based off the exponent (Exponent * [0.999999..., 0]). You can't arrive at a general solution just with bit-shifting tricks now. You have to take the source number and the denormalized exponent bias on the destination to determine what mantissa value will best match the source value when multiplied. In other words, from a double to a float, ::pow( 2, -126 ) * X ≈ SrcValue, solve for X. Now you have one case generalized for values where the exponent in both numbers matches and another generalized way to handle denormalized cases, and then you patch for cases where the smallest normalized number is closer to the source number than the highest denormalized number, plus rounding into InF, and you are working with a single integer so that adding to the mantissa properly rolls up the exponent when necessary. Great! Now throw that all away and copy this guy's code: https://stackoverflow.com/a/3542975 He hard-coded it from 32-bit floats to 16-bit floats. I generalized it to go from 64-bit floats to anything else. It works properly in all cases, including going into InF and NaN, but doesn't allow specifying a rounding mode explicitly. I will have to add that later. L. Spiro
2. ## How To Convert double To float

Today I got it fully compliant with IEEE standards and full support for denormalized values. I created some intrinsics that allow you to specify the properties of floats so you can make any kind of float you want as long as no components are larger than in a 64-bit double (a limitation I might handle in the future). Here are some examples. as_float10( 1.0 / 3 ) 0.328125 (3EA80000h, 3FD5000000000000h) as_float11( 1.0 / 3 ) 0.33203125 (3EAA0000h, 3FD5400000000000h) as_float14( 1.0 / 3 ) 0.3330078125 (3EAA8000h, 3FD5500000000000h) as_float16( 1.0 / 3 ) 0.333251953125 (3EAAA000h, 3FD5540000000000h) as_float32( 1.0 / 3 ) 0.3333333432674407958984375 (3EAAAAABh, 3FD5555560000000h) as_float64( 1.0 / 3 ) 0.333333333333333314829616256247390992939472198486328125 (3EAAAAABh, 3FD5555555555555h) These are just shortcuts for the common formats you might encounter. For a custom type you can use the full instrinsic: as_float( 1, 7, 20, true, 1.0 / 3 ) // as_float( signBits, expBits, manBits, implicitMantissa, value ) 0.33333301544189453125 (3EAAAAA0h, 3FD5555400000000h) I will be adding more features to get properties of custom floats too. For example: as_float_max( 1, 7, 20, true ) // Gets the maximum value for the given type of float. as_float_min( 1, 7, 20, true ) // Gets the min non-0 value for the given type of float. as_float_min_n( 1, 7, 20, true ) // Gets the min non-0 normalized value for the given type of float. as_float_inf( 1, 7, 20, true ) // Etc. Also some options to display the components of floats separately (sign, exponent, and mantissa), and more features. L. Spiro
3. ## How To Convert double To float

You’ve missed the point. The IEEE standard would give me guidelines that I can generalize to any floating-point format of any combination of bits, as demonstrated above. In the example of PI I gave, you can see how it degraded in precision based on the number of bits I assigned to the exponent and mantissa. From double = 3.1415926535897931, to float = 3.1415927410125732, to float16 = 3.1406250000000000. It’s exactly this degradation that it is important to see and investigate for programmers in general, but heavily for graphics programmers. And again, note that I am not relying on a CPU cast, I am doing the cast manually, so I am not worried about what the compiler will cast etc., and I am not restricted in the floating-point type. In the last example I literally just invented a 38-bit float. That’s the whole point. I need to cast manually because I need to inspect float types that are not natively supported in C/C++. If I were to get clear documentation, or better yet pseudocode showing the micro-instruction process for converting a double to a float, I will be able to generalize it for my purposes to cast to anything. Every number I posted above actually came out of my converter, which I wrote just last night. Even the maximum float value came from my implementation rather than looking at FLT_MAX. The fact that my class generates the same value as FLT_MAX is just because my implementation is 100% correct for normalized numbers. That part is a completely-solved area. I want to look at standards and example implementations so that I can be confident I’ve handled all edge cases specifically dealing with denormalized numbers. L. Spiro

5. ## Physically Accurate Material Layering

If you are still interested, we (at tri-Ace) published this paper on an efficient physically based layering system. http://research.tri-ace.com/Data/s2012_beyond_CourseNotes.pdf The Bouguer-Lambert-Beer law is mentioned and it is explained how we improved upon its performance. How IBL fits in is explained as well. L. Spiro
6. ## DirectXMath's XM_CALLCONV

I would. It is always an error to prioritize subjective aesthetics over functionality. It is always a mistake to ask permission from your compiler to implement a hack or "alternative" code. Checking that something works on your compiler only proves that it works as intended on one compiler. They literally gave a warning that without using this macro as a calling convention your code might not run correctly depending on your compiler and architecture. It is never valid to hinder code's portability simply because of your subjective views on aesthetics. L. Spiro
7. ## I'm so confused

Your topic looked like spam to someone. Don’t take it personally. L. Spiro
8. ## Goodbye!

L. Spiro

Yes, but unless you pass extra parameters that means all of your shadows have to have the same resolution. I don’t think NVIDIA is different. In either case, sampling a cube map actually emits a series of intrinsics that give the face index and 2D coordinates. Since consoles expose these intrinsics, my routines for Xbox One and PlayStation 4 are instruction-for-instruction exactly the same as a cube sample, except for one extra instruction to increase my Y coordinate based off the face index. My routine for Windows can’t use the intrinsics but should compile to the same thing. I don’t know of any open-source implementations as mine are derived from looking at shader assembly. Clearing can be done with a single call, which is a win on any platform that clears by just setting a flag, where the time is dominated by jumping back and forth between the driver and user code, etc. Less of a win for platforms that modify each pixel, but still a slight win. Filling requires no render-target swaps. Filtering becomes a win because you can easily use any shadow filtering you wish. As mentioned by JoeJ, you widen the projection for each cube face by a specified amount of pixels, so for example if you have a 512×512 texture and you want to widen the projection by exactly 3 pixels, your field-of-view will be 90.33473583181500191937274374069° instead of 90°. Now you have 3 border pixels to sample for any kind of filtering you wish to use with no complicated math to sample across faces etc. This also allows all of your shadows to have a unified look, as you will no longer have to use one filter for spot lights and a simpler one for point lights. L. Spiro