• Advertisement
Sign in to follow this  

How to include your profiler code in your profiler?

This topic is 441 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have written a partially simple performance profiler for my game projects and the basic functionality works: I can drop as many profiling start/end macro as i want and it magically shows me the number of cpu cycles, how much time it gets spent - great.
 
But when i include the part when the profiler builds up a tree for visualizing the recorded events - it simply crashes on a assert.
I know why this happen, but i have no idea how to get around that.
 
Now i explain how that system works:
 
1. I have a struct which defines a event, this contains a type (EventBegin, EventEnd), the actual rdtsc clocks, a unique identifier containing: __COUNTER__, __FUNCTION__, __LINE__ and a additional name + some extra parameters.
 
2. I have a external struct containing a two dimension array, storing the events (Ping pong buffer - or how is this called?)

constant U32 MAX_PROFILER_EVENT_COUNT = 16 * 65536;
struct ProfilerTable {
// NOTE: Event read array index (0 or 1)
U32 currentEventArrayIndex;
// NOTE: Contains two values, each 32bit long
// - Next event index to write to = EventCount - 1
// - Write event array index (0 or 1)
volatile U64 eventArrayIndex_EventIndex;
// NOTE: Two events array (One for read, One for write)
ProfilerEvent events[2][MAX_PROFILER_EVENT_COUNT];
};

3. In my main translation units i define the external table and allocate the required memory once and set everything to zero.
 
4. Now i may drop how many BEGIN() and END() into my code and each one records an ProfilerEvent with all parameters filled in.
It also works with nested blocks as well ;)
 

inline void RecordProfilerEvent(ProfilerType type, char* guid, F32 value = 0.0f) {
	Assert(externalProfilerTable);
	U64 arrayIndex_EventIndex = AtomicAddU64(&externalProfilerTable->eventArrayIndex_EventIndex, 1);
	U32 eventIndex = arrayIndex_EventIndex & 0xFFFFFFFF;
	Assert(eventIndex < ArrayCount(externalProfilerTable->events[0]));
	ProfilerEvent *ev = externalProfilerTable->events[arrayIndex_EventIndex >> 32] + eventIndex;
	ev->clock = __rdtsc();
...
}

#define BEGIN_BLOCK_(guid) {RecordProfilerEvent(ProfilerType_BeginBlock, guid);}
#define END_BLOCK_() {RecordProfilerEvent(ProfilerType_EndBlock, PROFILER_ID("::END_BLOCK::"));}
#define BEGIN_BLOCK(name) BEGIN_BLOCK_(PROFILER_ID(name))
#define END_BLOCK() END_BLOCK_()

void myAwesomeUpdateFunc() {
  BEGIN_BLOCK("Awesome stuff");
  ...
  END_BLOCK();
}

 
5. Right after the game loop has been finished, i toggle the event array index with atomic exchange, so that the event count is reset to zero and the event array index is swapped.
 

	// NOTE: Toggle event array index
	Assert(externalProfilerTable);
	externalProfilerTable->currentEventArrayIndex = !externalProfilerTable->currentEventArrayIndex;
	U64 arrayIndex_EventIndex = AtomicExchangeU64(&externalProfilerTable->eventArrayIndex_EventIndex, (U64)externalProfilerTable->currentEventArrayIndex << 32);
	U32 eventArrayIndex = arrayIndex_EventIndex >> 32;
	Assert(eventArrayIndex <= 1);
	U32 eventCount = arrayIndex_EventIndex & 0xFFFFFFFF;

 
6. Now i record all used events in a historical listy way right after i have done the atomic exchange (Note that atomic exchange returns the value before the change happened!).
 

	// NOTE: Record snapshot
	U32 snapshotIndex = AtomicExchangeU32(&profilerState->snapshotIndex, (profilerState->snapshotIndex + 1) % PROFILER_MAX_SNAPSHOT_COUNT);
	ProfilerSnapshot *snapshot = profilerState->snapshots + snapshotIndex;
	snapshot->eventCount = eventCount;
	for (U32 eventIndex = 0; eventIndex < eventCount; ++eventIndex) {
		ProfilerEvent *sourceEvent = externalProfilerTable->events[eventArrayIndex] + eventIndex;
		ProfilerEvent *targetEvent = snapshot->events + eventIndex;
		*targetEvent = *sourceEvent;
	}

 
7. Lastly i build a simple tree from the recorded events (Begin is a new sibling, End restores to the previous sibling). Also i calculate the actual delta value from the rdtsc and then i simply draw it in a table, graph, whatever.
 
Everything i descriped works great - except for timing this "Collation" process itself. Right now when i time this entire process (put a BEGIN() at the start of the function and a END() at the end of the function) - it either crashes due to my assert for trying to not corrupting the tree structure - or have an absolute value instead of a delta value.
 
When the BEGIN() starts before the atomic exchange, i get the assertion which is totally legal - because i write a start event in the current event array - but the end event gets inserted as first into the other event array.
 
But when i put the BEGIN() after the atomic exchange, the first two entries for the next frame contains the start and end event for that profiler collation - but from the previous frame.
 
I really want to see the profiled profiler timings at the end of my tree - so what should i do? How would you approach that problem?

Edited by Finalspace

Share this post


Link to post
Share on other sites
Advertisement

You could unit-test your profiler code but I think your code is way too complex to act as profiler. What I do is as simple as having a fixed size array of some very very simple buffer structure arround a block of memory so it looks like

TelemetryBuffer<512> packets[PROFILER_MAX_THREADS];

where TelemetryBuffer is just a ringbuffer of <int size> ProfilerData structs. That itself is a union for a single block of memory containing a 64 bit timestamt directly from the cpu time, a type of operation, some meta informations and a stacktrace pointer array.

 

When I do live profiling, each thread needs to register itself first and occupies one slot of my array with its thId. Each time that thId calls profiler it gets a pointer into the ringbuffer and may or may not set some meta information on it, anything else including timestamp and stacktrace is set by the profiler interface internally.

 

Last because I dont have any nor want any visualisation for it in the engine itself, data is send via simple UDP socket to another port on my computer or any other computer in the network that gains this packages and processes them to some kind of visual data/diagram.

 

When doing real non-live profiling I use the same technics but dont do any buffering or sending via UDP socket but fill a totally static buffer of ProfilerPackage data for a fixed ammount of threads for a fixed ID pattern to each thread (without using thId) and save it to disk.

 

I think thats all a profiler needs to do to do not disturb the game from running too much or cause itself issues for getting a clear and performance near image of the operations behind the game

Share this post


Link to post
Share on other sites

You could unit-test your profiler code but I think your code is way too complex to act as profiler. What I do is as simple as having a fixed size array of some very very simple buffer structure arround a block of memory so it looks like




TelemetryBuffer<512> packets[PROFILER_MAX_THREADS];
where TelemetryBuffer is just a ringbuffer of <int size> ProfilerData structs. That itself is a union for a single block of memory containing a 64 bit timestamt directly from the cpu time, a type of operation, some meta informations and a stacktrace pointer array.
 
When I do live profiling, each thread needs to register itself first and occupies one slot of my array with its thId. Each time that thId calls profiler it gets a pointer into the ringbuffer and may or may not set some meta information on it, anything else including timestamp and stacktrace is set by the profiler interface internally.
 
Last because I dont have any nor want any visualisation for it in the engine itself, data is send via simple UDP socket to another port on my computer or any other computer in the network that gains this packages and processes them to some kind of visual data/diagram.
 
When doing real non-live profiling I use the same technics but dont do any buffering or sending via UDP socket but fill a totally static buffer of ProfilerPackage data for a fixed ammount of threads for a fixed ID pattern to each thread (without using thId) and save it to disk.
 
I think thats all a profiler needs to do to do not disturb the game from running too much or cause itself issues for getting a clear and performance near image of the operations behind the game


Sure building and render profiling data is not fast, but game code wont be affected at all, isnt it?
Only the total frametime will increase, but each specific timed block in the game will be the same.
Also the profiling code will be fully out-compiled in shipping mode - so there are no performance penalities for end-users.

But the idea about streaming the data via UDP is nice - havent though about that. Maybe thats due to the fact that i have no net code written since over 15 years...

Oh i ring buffer my profiling results as well, so i can visualize multiple frames. Edited by Finalspace

Share this post


Link to post
Share on other sites
So you've built an instrumented profiler for your code.  Good, those are useful.
 
To profile your profiler, I'd take your profiler code and place it into its own small project. Something a little bigger than Hello World, but still only a few hundred or thousand lines long.
 
Then I would use another instrumented profiler and examine how much time it adds.  I would examine it again, with your profiler turned off.
 
There are many good profilers out there, including the free AMD's CodeXL, Intel's VTune, gprof if you're using gcc, on Linux the Valgrind suite's callgrind and cachegrind. 

Share this post


Link to post
Share on other sites

 

So you've built an instrumented profiler for your code.  Good, those are useful.
 
To profile your profiler, I'd take your profiler code and place it into its own small project. Something a little bigger than Hello World, but still only a few hundred or thousand lines long.
 
Then I would use another instrumented profiler and examine how much time it adds.  I would examine it again, with your profiler turned off.
 
There are many good profilers out there, including the free AMD's CodeXL, Intel's VTune, gprof if you're using gcc, on Linux the Valgrind suite's callgrind and cachegrind. 

 

 

This seems to be a bit of overboard, i really dont core how long the collation or rendering of the profiler takes - its debug code after all. But i want to see how much time it takes in addition to the rest of the frame.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement