• Advertisement
Sign in to follow this  

Anybody else seeing their code running slower on core i7?

This topic is 3285 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've had a raytracer I wrote a year ago, and it ran fairly well on my q6600 core 2 intel. I've recently built a core i7 machine, and to my surprise, the raytracer that ran 49fps now runs about 35-40fps. I wanted to track this down. To test further, I setup the core i7 at 2.7ghz, and I set my core2 at 2.7ghz. Then, I changed my raytracer to be a single thread. So this should be apples to apples comparison. The result was the Core2 ran at 14fps, and the Corei7 ran 8.3fps. What the heck? I installed vtune, and the problem area seems to be where I write an int to an array deep inside the raytracer in a function that is called hundreds of thousands of times. The only explanation I can come up with is that maybe that array is not in L2 cache on the i7 because its cache size is smaller? Vtune so far hasn't given very exacting results, so I'm not sure how much to trust it, but that's where I'm at. Which brings me to the question...anybody else seen similar results with some of their code on the new core? PS, all other benchmarks run faster on the i7, including a friends raytracer (altho it wasn't a realtime tracer).

Share this post


Link to post
Share on other sites
Advertisement
I've seen benches for the i7 indicating that turning of HT can increase performance. Take a look at Tech Report's i7 review to see what I mean.

Perhaps, try this and see what happens?

Share this post


Link to post
Share on other sites
Yeah, sorry, I forgot I tried HT on and off, and it made 0 difference in this circumstance.

In fact, when I crank my raytracer up to 8 threads with HT on, the framerate in one of my tests goes from 130fps to 170fps, so it's a clear win (and 8 threads with HT disabled provides no performance increase over 4 threads..it stays at 130fps). If I could figure out my strange performance penalties going on at the low level, I imagine i could see 250fps tho.

Share this post


Link to post
Share on other sites
You need to give more details if you want some help speeding it up. For example posting the source and disassembly of the bit code that VTune found was slow would help.

If it's a caching issue I'd expect better performance with a smaller output resolution, less polygons (and smaller textures if you have textured objects in your scene).

- How do the cache miss rates differ between the two CPUs?

- If you do use textures are they swizzled for better locality?

- What's the total size of all the data used for rendering the scene?

Share this post


Link to post
Share on other sites
Given that the problem is reported to be writing ints to an array, I can think of several possible causes due to microarchitecture:
- cache/page splits due to unaligned writes
- TLB misses
- limited store buffers
All of the above can be ruled out because i7 is reported to handle them better or increase the number of resources vs. Core2. The remaining (and most likely) reason is the new cache structure, where L2 is now much smaller (4x256 KiB vs 2x6 MiB) but a few clocks faster. This is supposed to be offset by the new 8 MB L3 cache, but it is twice as slow as the old L2 and its effective size is < 7 MiB due to inclusion of lower levels.
Interesting situation; it looks like single-thread code with largish working sets may perform worse on i7 due to the new cache despite faster access to main memory and all the other improvements. The tables should however be turned if you change your benchmark to 4 threads (hopefully scheduled to individual cores), which will make the entire L2 available on i7 and cause some fighting on Core2's shared cache.

Share this post


Link to post
Share on other sites
Quote:
Original post by Adam_42
You need to give more details if you want some help speeding it up.


Agreed. This post was meant to simply ask if anybody else had seen similar end results...slower code on i7. I'm not expecting people to dig through my code and find the problem for me, I just want to see if I'm alone in this.

The scene is very small that it reads from, maybe 5kb of data (its only reflective spheres). The whole thing is very compact, which is why I'm thinking maybe everything fit into the l2 cache of the Core2, but now just barely doesn't on the i7? There are no textures or anything of that sort...literally 5kb of source data. For reference, this is the test scene (its rendered at 720p): http://jmx.ls1howto.com/rtrt/soft_shadows_small.jpg

VTune thus far hasn't been very helpful but I've not yet tried an instrumented build. I'll be exploring that more this weekend. I might even RTFM for VTune :) As a side note, I have found that the profilers that come with video game consoles have been way more helpful, and I'm kinda shocked that VTune didn't seem as nice, but I've maybe just not given it a chance yet.

Jan W, the Core2 version is faster than Corei7 with 1,2,3 or 4 threads. It doesn't matter. Once the i7 gets to 6 or 7 threads with HT, it finally wins out over the Core2 right now.


Anyway, I'll surely post my results on here when I figure out whats up, and hopefully have an example code snippet that illustrates the problem.

Share this post


Link to post
Share on other sites
Quote:
Original post by jmX
Anyway, I'll surely post my results on here when I figure out whats up, and hopefully have an example code snippet that illustrates the problem.


I'll be interested to see this.

Share this post


Link to post
Share on other sites
Ok, so here's the profiling of the trouble code on Core2 Q6600 (2700mhz):


And here's the profiling of the trouble code on Core i7 (2760mhz):


And for reference, its a 1024x512 raytrace of some spheres.


If VTune is to be trusted, there's some major issues going on inside that function with the branch prediction. Obviously there's a ton of branches, but that code really works great on the Core2. What gives?

Share this post


Link to post
Share on other sites
You can eliminate that branch with some bit manipulation trickery:


// declarations
int i1;
int[] distance;

// original
if (i1 < distance[r])
{
distance[r] = i1;
retval[r] = 1;
}

// branch free
int i1;
int[] distance;
int test = (distance[r] - i1) >> 31; // test = (distance[r] - i1) < 0 ? -1 : 0
distance[r] = distance[r] - ((distance[r] - i1) & test);
retval[r] |= test & 1; // Can drop the &1 if you're happy with -1 instead of 1

Share this post


Link to post
Share on other sites
I've never used VTune before, but does it have an option to display assembly instructions instead of annotated source? The accuracy of assigning ticks to source lines is not always accurate. And can you use the same counters for both runs?

Anyhow, if those results are accurate your problem could be branch aliasing. If your compiler supports branch hints you can try hinting the problem branches one way or another and seeing if things change on the i7. IIRC static branch hints override branch prediction so you should see different results.

Share this post


Link to post
Share on other sites
Quote:
Original post by Adam_42
You can eliminate that branch with some bit manipulation trickery:

*** Source Snippet Removed ***


Except that the OP is working with floating-point values (hence 'f32') - as one might expect, since it seems that he's working with roots of a quadratic equation ('det' is presumably short for 'determinant'). I'm not sure the sign-extension is guaranteed to work like that for the right-shift, either.

We can still refactor a little... this removes a bit of duplication, although I'm afraid it won't likely gain any performance:


if (det > 0.0f) {
// Why isn't "ba" already an array of/pointer to f32?
f32 b = -(reinterpret_cast<f32*>(&_ba)[r]);
f32 i2 = b + det;
if (i2 > 0.0f) {
bool both_positive = b > det;
f32 imin = both_positive ? b - det : b + det;

if (imin < distance[r]) {
distance[r] = imin;
retval |= 1 << (r * 2 + int(both_positive));
}
}
}



(BTW, doesn't the calculation of 'det' involve a square root? So don't you already know it's >= 0 if you get to this point (and following this code in the == 0 case shouldn't be harmful)? )

Share this post


Link to post
Share on other sites
Quote:
Original post by Zahlman
(BTW, doesn't the calculation of 'det' involve a square root? So don't you already know it's >= 0 if you get to this point (and following this code in the == 0 case shouldn't be harmful)? )

I don't think the calculation of a determinant involves a square root.

Share this post


Link to post
Share on other sites
For whatever it's worth, I have a 100% branch free version of the function that uses SSE commands to test things in parallel and does conditional stores of the results. The point of the post was more of why this one function totally eats it on the i7, but not on my core2. Very strange.

Also, in the case of this function, det <= 0.0 means that no sphere was intersected.


I did start toying with that function, and after changing 1 line of code, the problem vanished, and the app went from 8.9fps to 21fps. The problem really is baffling. I was in the process of SSE'ing all the code, and as I changed the first line from
"if (det > 0.0f)"
to
u32 detMask = ((u32*)&detsGreaterThanZero)[r];
if (detMask)

Basically just did an SSE test to see if dets were greater than zero and then checked the results mask instead of the float. Now, we all know "det > 0.0f" isn't going to cause an app to run 8.9fps instead of 21fps, so obviously something is causing the processor to choke on the resulting code. I've not had the time yet to do a side by side comparison of the assembly yet, but will soon.

VS 2008 gave the same results, as do x64 builds.

Share this post


Link to post
Share on other sites
I think you might be suffering from denormalized numbers. Intel might have traded even slower denormal performance (as if they weren't bad enough already) for improved general performance or something like that.

Share this post


Link to post
Share on other sites
Quote:
Original post by janta
Quote:
Original post by Zahlman
(BTW, doesn't the calculation of 'det' involve a square root? So don't you already know it's >= 0 if you get to this point (and following this code in the == 0 case shouldn't be harmful)? )

I don't think the calculation of a determinant involves a square root.


Sorry; I was thinking of the discriminant. Which seemed more likely to be involved given (a) the OP is ray-tracing spheres; and (b) the form of the equations (evaluating b +/- det, and checking for positive values... positive real roots of a quadratic, I inferred). I suppose both OP and I thought of the wrong term? :)

Share this post


Link to post
Share on other sites
Quote:
Original post by Extrarius
I think you might be suffering from denormalized numbers. Intel might have traded even slower denormal performance (as if they weren't bad enough already) for improved general performance or something like that.


Even so, you'd think a compiler could optimize ">= 0.0f" into "((any zero pattern) or (sign bit)) and not (NaN)"... which ought to be fairly easily tested... :/ And anyway, it seems that the slow code was slow due to branch misprediction rather than the conditional expression itself, so something very strange is going on.

OP, try looking at the disassembly?

Share this post


Link to post
Share on other sites
Quote:
Original post by jmX
I did start toying with that function, and after changing 1 line of code, the problem vanished, and the app went from 8.9fps to 21fps. The problem really is baffling. I was in the process of SSE'ing all the code, and as I changed the first line from
"if (det > 0.0f)"
to
u32 detMask = ((u32*)&detsGreaterThanZero)[r];
if (detMask)

Basically just did an SSE test to see if dets were greater than zero and then checked the results mask instead of the float. Now, we all know "det > 0.0f" isn't going to cause an app to run 8.9fps instead of 21fps, so obviously something is causing the processor to choke on the resulting code. I've not had the time yet to do a side by side comparison of the assembly yet, but will soon.

VS 2008 gave the same results, as do x64 builds.


I'd strongly suspect branch aliasing based on that. Assuming that little change worked out to either more or less instructions than the previous version a problem branch somewhere moved either up or down and no longer aliases some other branch(es). Branch prediction uses the lower bits of a branch instruction's address as a tag, so if a branch history table is say 32 entries large and you have two branches 8 instructions apart they'll use the same entry in the table and interfere with each other as far as predictions go (simplified example). Maybe the i7 branch table changed in size, which would explain why you don't see it on the other processor.

Although, it seems you used two different compilers and 32- and 64-bit builds, and since most compilers I know of don't model branch aliasing, it would be pretty amazing if all these variations just happened to generate code with aliasing branches. Simplest test: stick a nop before the if (det > 0) test, or maybe after, or maybe around some other branches that look like they're not predicting well.

Share this post


Link to post
Share on other sites
UTC (the current Visual Studio compiler) does better with the integral comparison than with the floating point comparison.


#include <stdio.h>

extern float * myArray;

int main()
{
volatile int i = 0;

float f = myArray[i];
if(f > 0.0f)
{
printf("hello world");
}

unsigned int u = ((unsigned int *)myArray)[i];

if(u)
{
printf("hello world2");
}

return 0;
}



UTC-Floating point comparison


movsxd rcx, DWORD PTR i$[rsp]
mov rax, QWORD PTR ?myArray@@3PEAMEA ; myArray
movss xmm0, DWORD PTR [rax+rcx*4]
; Line 10
comiss xmm0, DWORD PTR __real@00000000
jbe SHORT $LN2@main
; Line 12
lea rcx, OFFSET FLAT:??_C@_0M@LACCCNMM@hello?5world?$AA@
call printf



UTC-Integral comparison


movsxd rcx, DWORD PTR i$[rsp]
mov rax, QWORD PTR ?myArray@@3PEAMEA ; myArray
; Line 17
cmp DWORD PTR [rax+rcx*4], 0
je SHORT $LN5@main
; Line 19
lea rcx, OFFSET FLAT:??_C@_0N@HPBCPIMH@hello?5world2?$AA@
call printf

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement