Jump to content

  • Log In with Google      Sign In   
  • Create Account


Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Yesterday, 04:39 PM

#5182132 Problem with Constant Buffer Size.

Posted by Matias Goldberg on Yesterday, 10:07 AM

The 48 bytes comes from a mistake in your C++ code, not in your HLSL code.




#5182015 Cache coherence and containers

Posted by Matias Goldberg on 21 September 2014 - 10:14 PM

std::map< int* > mymap;

That is not valid code. std::map requires a key and a value.




#5181109 Why does divison by 0 always result in negative NaN?

Posted by Matias Goldberg on 17 September 2014 - 02:29 PM

The C++ standard does not mandate that the IEEE standard should be used for floating point.

Case in point, the PS2 did not follow it and anything divided by 0 returned 0, even 0 / 0 (which was a nice property for normalizing vectors and quaternions without having to check for null length).

 

Perhaps it's unfortunate that the C++ std says "undefined behavior", instead of "implementation defined". But that's how it is.

If it were implementation defined, I would rest assured MSVC, GCC & Clang would compile my code fine in an x86 machine, because it follows the IEEE. But unfortunately, it's UB, not ID.

 

In real world though, I would be very mad if the compiler optimizes my UB code away without any warning because the amount of UB scenarios the C++ standard can have are gigantic, and mistakes like this happen every time.

The ever lasting struggle of compiler writers who want to take advantage of the UB for optimization and are very picky/elitist about following the Std; vs the average programmer who wants to develop a program that isn't broken by technicalities like this.




#5180865 GPU Perfstudio error

Posted by Matias Goldberg on 16 September 2014 - 06:09 PM

There is a hacked version of PIX to make it work on newer systems.




#5180864 Why does divison by 0 always result in negative NaN?

Posted by Matias Goldberg on 16 September 2014 - 06:07 PM

However, in general it's not true. At the risk of beating a dead horse by bringing up the same topic again as two weeks ago: No, the compiler, in general, cannot just do whatever it wants. If you divide some float by some other float, the compiler has to, and will, emit code that does just that.
Although it may, of course, add supplementary code to every division which checks whether the denominator is zero and calls an error handler (or similar), but it may not just do just about anything. That includes eliminating branches. Unless it can prove that the denominator will be zero at compile-time already, it may not optimize out code (or... format your hard disk tongue.png ).

I'm not sure what you mean.
The following snippet triggers UB:

int result = numerator / denominator;
if( denominator == 0 ) //UB: Denominator couldn't be 0 by now. Branch can be left out.
{
    result = 0;
    doSomething();
}


float result = numerator / denominator;
if( denominator == 0 ) //UB: On architectures that don't follow IEEE, Denominator couldn't be 0 by now. Branch can be left out.
{
    result = 0.0f;
    doSomething();
}
There has been way too many security bugs in real world code caused by optimizing compilers removing important pieces of code due to them being executed after invoking undefined behavior instead of being executed earlier.


#5178700 Texture Array Memory Allocation

Posted by Matias Goldberg on 07 September 2014 - 09:45 AM

OpenGL has now two types of textures:

 

Mutable storage.

Gets allocated when you call glTexImageND (where N is 1, 2 or 3). Additional calls to glTexImageND destroy and recreate the texture if the format changes in any way (resolution, bit depth, internal format). You can use glTexSubImageND to upload portions (or the entire image) without reformatting the texture.

  • If you call glTexImageND( 1024, 1024, 64 ); then glTexSubImageND( 1024, 1024, 4 ) you will allocate a 1024x1024x64 texture then upload the 4 slices (you waste 60).
  • If you call glTexImageND( 1024, 1024, 64 ); then glTexImageND( 1024, 1024, 4, ..., data ); you will allocate a 1024x1024x64, then destroy it, create a 1024x1024x4 texture and upload the 4 slices. You don't waste memory, but you burn a lot of CPU & GPU cycles with unnecessary allocations, deallocations, and pipeline stalls.
  • glTexImageND( 1024, 1024, 4, ..., data ); you will create and upload the 4 slices. No waste. If you later call glTexImageND( 1024, 1024, 64, ..., data ); you will destroy the old texture and create a 1024x1024x64 and upload all 64 slices. It's as wasteful as the previous point.

 

Immutable storage.

The "modern" way. glTexStorageND​ allocates the texture and can never be changed afterwards. Subsequent glTexStorageND & glTexImageND calls are invalid. You can use glTexSubImageND to reupload the entire image or parts of it.

  • If you call glTexStorageND( 1024, 1024, 4 ) you will create a 1024x1024x4 texture. But you can't change it to a 1024x1024x64 later. You will need to call glDeleteTextures and recreate it yourself.
  • If you call glTexStorageND( 1024, 1024, 64 ) you will create a 1024x1024x64 texture.

 

So, bottom line, it depends on how you handle the situation. Immutable storage is preferred over mutable storage, because with immutable storage the wrong parameter will just trigger an error. On mutable storage, with the wrong parameter the driver will silently try to honour your weird request, slowdown your app a lot, and you may see random glitches without a warning of what's wrong.

 

Don't mix glTexImageND & glTexStorageND on the same texture ID btw. You can use glTexImageND on both though.




#5178516 GOTO, why are you adverse to using it

Posted by Matias Goldberg on 06 September 2014 - 08:27 AM

You speak as if you never programmed anything of more than trivial complexity.   BREAK and CONTINUE   can be less clear where there are (many) nested loops  or where your   'just jump your eyes'   might be to along ways off the screen far below.

Don't put words in my mouth (or in my keyboard tongue.png) that I didn't say. I said the comparison between break/continue and goto was unfair because goto is worse than break & continue. I didn't say break/continue were good practice. I try to avoid them too (however for goto, I just never use it).
Furthermore, I said that where there are (many) nested loops it is bad practice to use break/continue there. Read my post again.
 

Do you have any examples of the sorts of circumstances under which more modern compilers are untrustworthy?

Yes. I have a few examples. I encounter things like this all the time. I used to have a link to a comparison between GCC, Clang & VS showing how bad they all generated intrinsic code but I can't seem to find it now.

 

Aside from that (which can be treated as codegen performance bugs that will eventually be fixed), there's a limit on how much the compiler can optimize because of assumptions the compiler cannot make due to how the language works (i.e. restrict helps us a lot).

Something as simple as moving member variables to a local variables before going into a loop that calls a function can greatly improve performance because the compiler may not know if the member variables are going to be modified by the function being called; thus the compiler has to reload them on every iteration. With local variables, the compiler can assume that (*).

Another example is avoiding writing a non-const reference to a variable/object until the very end of the routine (i.e. an output variable) and only read from it at the beginning; otherwise the compiler cannot assume the variable isn't aliased to any of our member variables (or to any other non-local variable).

 

There is a limit on what the compiler can optimize for us. Not because the optimizer is good/bad, but rather because the language enforces some guarantees (with good reason!) that sometimes costs us a lot of CPU time.

This is a very current problem since memory latency and bandwidth has become a bottleneck; and turns out the compiler can't do much to optimize out (unnecessary-in-practice) memory reloads; and needs our help.

If you're writing accountancy software, ERP, or something like that; you will never have to worry about that. But in gamedev where we squeeze every drop of cpu cycles and iterate over many thousands of objects many times per second; then yes; these limitations begin to notice.

 

 

(*) Massively copying everything to local variables won't help either; because this data won't fit into registers and will end up in the stack every time we enter that troublesome function and reload when we leave it; which is basically what the compiler would do with member variables, but worse.

In such cases one has to sit down and analyze the sources of aliasing and rewrite the routine in a way that they get minimized or contained to a particular place where it won't matter. Also one has to analyze the possibility of enabling the visibility of the routine to the compiler so the compiler can know what it is doing and what gets modified (but if it's a virtual function, the compiler can't do it even if it sees the function's body). LCGT can help too in this cases; but remember that even then, there are some assumptions the compiler still can't make.




#5178410 GOTO, why are you adverse to using it

Posted by Matias Goldberg on 05 September 2014 - 04:22 PM

goto breaks the flow of the program and makes significantly harder to track or predict how a program will execute by just glancing at the code. It also makes silly bugs and mistakes much easier to introduce.
It basically reverts any language back to assembly-level difficulty where you have to follow instruction by instruction to see how the execution jumps around; except that its even harder because in assembly at least you've got an instruction per line; whereas in other languages one expression could be expanded to multiple statements.
I cannot simply place my finger at a random place, start from there, and understand what's going on. Code with goto forces me to go to the start.

Comparing with continue and break is not fair, because continue and break have very clear flow of the program: Just jump your eyes to the bottom of the ending bracket '}' (unless you have deeply nested break and/or continue statements, which is also hard to track and also considered a bad practice anyway). If the code inside the brackets is too long and you have to scroll down too much, again... you're doing it wrong.

goto on the other hand, can jump anywhere, like a flea. It may jump backwards a couple lines. It may jump backwards a thousand lines, then jump again forward another couple thousand lines. You never know where it's going to land and it's very distracting. Not to mention you have to keep a mental model of the state of the program before each jump.
May be you think writing code with goto is easy and convenient. And indeed it is. But when you have to maintain that code a year after you wrote it, or collaborate with other people who need to read and understand what you wrote; goto becomes a nightmare.


#5178222 my SIMD implementation is very slow :(

Posted by Matias Goldberg on 04 September 2014 - 10:30 PM

Nobody said it, so I'm gonna say it.

USE __RESTRICT

 

Also your code has an if( determinant .... ); which sounds to me it could be converted to branchless using conditional moves. Unless the cost of executing the code that can be skipped is considerably larger than the cost of misspredicting the branch.




#5178219 One Buffer Vs. Multiple Buffers...

Posted by Matias Goldberg on 04 September 2014 - 10:24 PM

Please let me know what you think, i'm new to gpu and it's still hard to predict performance.

Experts can't predict performance either. We're just driven by common guidelines AND THEN WE TRY OUR OPTIONS AND PROFILE.

Computing (in general, i.e. not just GPUs, also CPUs, and RAM, and caches, etc) has become so complex it's virtually impossible to accurately predict what approach is going to be faster (although we can make educated guesses).
For example, I've seen an example where adding extra instructions to a CPU routine in a tight loop caused the loop to execute faster in Haswell chips.

The reason had to do with a "forward store to load" stall, where adding an instruction allowed to the CPU to prevent a full pipeline stall on every iteration.

It is anecdotic and was a rather synthetic benchmark (not real world code), but the point is that it is completely unintuitive to think that adding an instruction would help the code run faster; which is a great example that modern architectures are so complex we can't grasp it all.

 

Stop asking, just try, profile, and share the results.




#5177452 Vertices with multiple UV coordinates and glDrawElements()?

Posted by Matias Goldberg on 01 September 2014 - 11:15 AM

Polycount vs Vertex count: a cube can be made with 36 vertices, 24 or 8. (particularly see the sections "Common misconception: Why is the GPU so dumb?" and "This is going to change in the future, right?")


#5176092 HLSL load() of directx 10 in directx 9

Posted by Matias Goldberg on 25 August 2014 - 04:47 PM

Fun AMD hack: If you set filtering to point, but mipmapping to anisotropic without mipmaps in the texture, and use sample2DLod, the driver will recognize it and you should get faster sampling of single pixels. Same with DX10's SampleLevel.

But profile it first; just to be sure it's done right.




#5175561 DXT normal map compression

Posted by Matias Goldberg on 22 August 2014 - 05:51 PM

You should read Understanding BCn Texture Compression Formats from Nathan Reed.

 

It should clear your questions. The only thing to add is that, before we had BC5 (which was only introduced in DX10; unless you used the ATI hacks); we used the technique you linked above: DXT5 and store the components in the green and alpha channel.

The alpha channel in DXT5 is encoded exactly the way BC5 does for its components. However the RGB components are not. They're encoded like in BC1, which delivers lower quality for our purposes. The Green channel is used over the Red and Blue because it has one more bit of precision.

 

This means BC5 beats DXT5 in terms of quality because the latter wastes precision on unused Red & Blue channels.




#5172521 DirectX 10 - Displaying tile squares as...squares?

Posted by Matias Goldberg on 09 August 2014 - 05:49 PM

I guess what you mean is that you have to accommodate for the aspect ratio.

 

Most tutorials online (and helper functions) teach you how to build a perspective matrix and view matrix taking into account the aspect ratio.

 

Change the parameters to use an AR = 1 (or set the width = height when passing to the projection/view matrix generation function helpers).

 

Or just compensate the AR (apply 1 / AR to the Y axis in the vertex shader)




#5172190 Debugging a system hang (entire OS freezes)

Posted by Matias Goldberg on 07 August 2014 - 08:48 PM

Check the system logs in the Control Panel. There's often useful information there.

Check if minidumps are being generated.
If you've disabled your Swap File or its size is too low, minidumps won't be generated.

Also don't rule the possibility of hardware problems. If your application has no VSync, and the CPU is not doing anything, the GPU may be running so fast it overheats or your PSU can't deliver enough power; which is harder to witness in other programs because rarely the GPU runs 100% uninterrupted (because of VSync or because the CPU has actually some work to do, stalling the GPU)




PARTNERS