Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 02 Nov 2009
Offline Last Active May 24 2015 01:49 AM

Posts I've Made

In Topic: Increased frame time with less workload

23 May 2015 - 05:00 PM

@Hodgman I tried to find two positions where they have about the same frame rate and frame time:


The so called "slow" facing:




The "fast" facing:


The frame times are very interesting i think. The "pre frame time" is a bit a misleading title, its the actual time from glClear until SwapBuffers. The "Post Frame Time" is the time from the beginning of SwapBuffers to the end of SwapBuffers. SwapBuffers is also calling that NtGdiDdDDIEscape which gets most of the samples in the profiler. This is also where the difference happens when im using roughly the same part of the landscape for both facings. The slower uses 2.6ms in SwapBuffers, the faster one uses 1.6ms in SwapBuffers. The "pre Frame Time" (the drawing itself) is proportional to the number of indices/draw calls. 


I assume the above is because the draw commands are actually executed when SwapBuffers flushes the queue? But what is this NtGdiDdDDIEscape function thats using up most of the time in SwapBuffers according to the profiler?

In Topic: Increased frame time with less workload

23 May 2015 - 04:10 PM

Yes, im aware of the fps vs frame time difference, i also calculated that 1ms difference which seemed strange to me.


The "slow" version is facing in the direction where the specular reflection on the blocks is at its maximum, the "fast" version is 180° in the other direction with absolutely no specular lighting. The shadows are AO, everything is rendered using the same code on CPU and GPU. Since AO and also the exact lighting isnt fully implemented yet it looks a bit odd, like the specular lighting trumps the AO which makes it look a lot different. The important part is, that there is currently only one path to rendering the blocks and it uses all the same states.


I suspected that it might be that in one direction my frustum/box intersection code might get more cases where it can early out, but profiling both situations over a longer time period yielded no major difference (fast version: 2.6% of the samples in intersection, slow version 3.1% of the samples in intersection, the difference might just be because one version was sampled over a longer period).


Here is what the profiler says for both version considering the functions doing the most work:
fast version:

something in nvoglv32.dll - the actual rendering i guess -> 28.67% Exclusive Samples

NtGdiDdDDIEscape - no idea what that is, nor ever seen in any other project -> 26.09%

NtGdiDdDDIGetDeviceState - same -> 9.59%

Math::Frustum::intersects -> 4.34%

RtlQueryperformanceCounter -> 3.16%


slow version -> same functions, same order, only %-values:







Some pretty much the same for CPU sampling.

In Topic: Increased frame time with less workload

23 May 2015 - 03:24 PM

Oh, i thought it was, but actually it wasn't, forgot to turn it back on after the last experiments, which makes it even strange for me since it means every triangle went through the entire stage. I enabled it now but the difference remains, just on a higher level



In Topic: Getting invalid instance matrices in vertex shader

24 January 2015 - 06:47 PM

Hello again


Because i didnt know what else to try i just did a transpose on the instance matrix and what do you know, it works. In retrospect it makes sense, float4x4 is column-major, the float4x4 contstructor takes row-major. Thus i had to pass the matrix entries untransposed compared to view/proj and the rest.




In Topic: Passing cube normals to shader

10 January 2015 - 03:40 AM

Hello L. Spiro


Thanks for your answer, i already thought that it will be like that. For your second quote: I of course meant "without using 36 (24) vertices". /EDIT: No, actually i didnt, i wasnt remembering my posting correctly/


Anyway, I was able to store 3d position, normal, texcoord and color in 8 bytes per vertex, so ill just stick with the inreased vertex count (the devices its going to be running on have limited gpu memory available).