Why does this take so long?

Started by
8 comments, last by iMalc 15 years, 8 months ago
I have a ray tracer that renders a simple scene in under a second in release mode, but 25 seconds in debug mode - which is a pain. So I profiled to see what checks were being put in that were taking so long. A full 12% of the time is being spent in an accessor to a vector, which basically looks:
class Vector
{
  float elems[3];

public:
  const float_t& operator[](int index) const
  { // 12% time of entire program being spent on this line
    return elems[index];
  }
}
Disassembly of the prologue to the funtion (second column is percentage of time spent on that line, relative to the entire program):
push ebp                	0.2
mov ebp,esp             	0.05
push byte ffh           	0
push dword 00426fd5h    	0.21
mov eax,fs:[00000000h]  	0.04
push eax                	0
sub esp,00000128h       	0.16
push ebx                	0.05
push esi                	0
push edi                	0.22
push ecx                	0.09
lea edi,[ebp-00000134h] 	0.06
mov ecx,0000004ah       	0.14
mov eax,cccccccch       	0.03
rep stosd es:[edi]      	10.24
pop ecx                 	0.21
mov eax,[004302a8h]     	0.16
xor eax,ebp             	0.01
push eax                	0.24
lea eax,[ebp-0ch]       	0
mov fs:[00000000h],eax  	0
mov [ebp-14h],ecx       	0.34
Obviously the rep stosd es:[edi] is the line primarily responsible. I'm pretty sure that this is storing a string - what is this string? Having a poke around in memory doesn't reveal anything meaningful, although I guess it could be unicode.
[TheUnbeliever]
Advertisement
Iterator debugging.

Profiling debug builds is pointless anyway. It's like trying to estimate your sprinting performance in a pool.
Quote:Original post by Antheus
Iterator debugging.


I wondered about this, but I don't think it's the case: firstly, this is a direct index into an array, which I don't think gets checked; secondly, disabling iterator debugging has no effect. Thanks for the suggestion, though.

Quote:Profiling debug builds is pointless anyway. It's like trying to estimate your sprinting performance in a pool.


I know that (and originally included a disclaimer in the first post, but edited it out for the sake of brevity): it's just that the slowdown in the debug release makes debugging a little bit of a pain, and so I was specifically looking to find the check that's making it take so long and, if possible, avoid it. Also, it's sufficiently significant that I was simply curious. :-)
[TheUnbeliever]
Quote:Original post by TheUnbeliever
Quote:Original post by Antheus
Iterator debugging.


I wondered about this, but I don't think it's the case: firstly, this is a direct index into an array, which I don't think gets checked; secondly, disabling iterator debugging has no effect. Thanks for the suggestion, though.


Ugh, I assumed it was dealing with iterators in STL fashion. Then it's not that obviously.
That's not doing anything with a string. It's initializing 296 bytes in the stack to contain cch, which is basically a pattern "1010101010101010..." in binary. I don't know why it's doing that (some mechanism to detect out-of-bounds writing?).
Here's what's going on:
Lines 1..2: setting up a stack frame
Lines 3..6: build SEH record on the stack
Line 7: reserving space for local variables (why so much? what local variables?)
Lines 8..10: saving registers
Lines 11..16: initializing local variables to 0xCC
Lines 17..22: install SEH handler

Quote:It's initializing 296 bytes in the stack to contain cch, which is basically a pattern "1010101010101010..." in binary. I don't know why it's doing that (some mechanism to detect out-of-bounds writing?)

VC always does this in debug mode; the idea is to expose the use of uninitialized local variables. (0xCC..CC is a rather unusual value for ints and floats, so you quickly notice; more so than if the memory were zero-initialized).
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
The profiling data can show a lot of time being spent in your operator[] purely because it gets used a lot. It by no means proves that it is slow.
Personally I would not trust profiling data that gives you time-spent per assembly instruction. Profiling at that level cannot be both fast and accurate. (Just look at the variation on those 'push' lines)

However, one thing that will likely speed it up is to change it to return by value instead of by const reference. Leave the non-const version to return by reference though of course.
Also, what is the difference between float and float_t? If float_t is not simply a typedef for float then there is more going on there that may take some time. If not, then you may as well make them the same anyway.

Rather then focusing on the operator [] you need to focus on the things that call operator [] and reduce those. Post such code if you need help with it.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
To remove that setting to 0xcc under the C/C++ code generation settings turn off "Basic Runtime Checks".

It's also possible (although more difficult since instructions get reordered, and some variables end up in registers) to debug release compiled code. Just make sure it's compiled with debug info on.

You can also control optimization settings on a per file basis, so it might be worth using the release build, but with optimizations disabled for the files that you need to debug. There's also #pragma optimize for even more control.

There's also a simple (but somewhat nasty) way to optimize the calling code to work round that slow operator. Let's assume you have code like:
for (size_t i=0; i < somevector.size(); i++){   float_t f = somevector;   // Do stuff}


Assuming operator[] does return by reference replace it with:
const float_t *array = &somevector[0];const size_t len = somevector.size();for (size_t i=0; i < len; i++){   float_t f = array;   // Do stuff}


This probably won't have any effect on the speed in release mode, but it'll help a lot in debug as functions won't get inlined there.
Quote:Original post by Jan Wassenberg
Line 7: reserving space for local variables (why so much? what local variables?)


There are no local variables at all.

Quote:Original post by iMalc
Also, what is the difference between float and float_t? If float_t is not simply a typedef for float then there is more going on there that may take some time. If not, then you may as well make them the same anyway.


float_t is a typedef, passed in through a template - which all happens at compile time, obviously. I'm not entirely convinced that there's any utility of having a template for it, though; I might just use floats and be damned.

Quote:Original post by iMalc
Rather then focusing on the operator [] you need to focus on the things that call operator [] and reduce those. Post such code if you need help with it.


I don't think they're too sub-optimal, but nor are they particular optimized:

		boost::optional<float_t> intersect(const Ray<float_t>& ray)		{			typedef Maths::Vector3<float_t> Vector3;			Vector3 d = ray.from - centre;			float_t a = Maths::dot(ray.direction, ray.direction);			float_t b = 2 * Maths::dot(ray.direction, d);			float_t c = Maths::dot(d, d) - radiusSquared;			float_t discriminant = b*b - 4*a*c;			if (discriminant < 0)			{				return boost::none;			}			else			{				float_t t1 = (-b + std::sqrt(discriminant)) / (2*a);				float_t t2 = (-b - std::sqrt(discriminant)) / (2*a);				float_t tmin = std::min(t1, t2);				return (tmin < 0 ? std::max(t1, t2) : tmin);			}		}// ...		boost::optional<float_t> intersect(const Ray<float_t>& ray)		{			if (Maths::dot(ray.direction, normal) == 0)				return boost::none;			float_t a = normal.x();			float_t b = normal.y();			float_t c = normal.z();			float_t d = -Maths::dot(normal, point);			float_t t = -(a * ray.from.x() + b * ray.from.y() + c * ray.from.z() + d);			t /= a * ray.direction.x() + b * ray.direction.y() + c * ray.direction.z();			return t;		}


@Adam42 - That's handy, thanks.
[TheUnbeliever]
I presume that things like Math::dot call your operator []. I would tend to put things like dot in your vector class instead where it can access the array directly.
You could consider making Math a friend of Vector and accessing the array directly that way. Not sure if I'd be totally comfortable with that though.

You could also consider using the geometric method of finding the intersection between a ray and a sphere. I think it has better early-out opportunities, which again could also reduce calls to operator [].

I really hope the compiler is being smart enough to reduce the two sqrt calls into one. If not, that'd make a reasonable dent on the speed of that function. I would explicitly refactor out the common sub-expression to be sure.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

This topic is closed to new replies.

Advertisement