Jump to content
  • Advertisement
Sign in to follow this  
jmX

Anybody else seeing their code running slower on core i7?

This topic is 3461 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've had a raytracer I wrote a year ago, and it ran fairly well on my q6600 core 2 intel. I've recently built a core i7 machine, and to my surprise, the raytracer that ran 49fps now runs about 35-40fps. I wanted to track this down. To test further, I setup the core i7 at 2.7ghz, and I set my core2 at 2.7ghz. Then, I changed my raytracer to be a single thread. So this should be apples to apples comparison. The result was the Core2 ran at 14fps, and the Corei7 ran 8.3fps. What the heck? I installed vtune, and the problem area seems to be where I write an int to an array deep inside the raytracer in a function that is called hundreds of thousands of times. The only explanation I can come up with is that maybe that array is not in L2 cache on the i7 because its cache size is smaller? Vtune so far hasn't given very exacting results, so I'm not sure how much to trust it, but that's where I'm at. Which brings me to the question...anybody else seen similar results with some of their code on the new core? PS, all other benchmarks run faster on the i7, including a friends raytracer (altho it wasn't a realtime tracer).

Share this post


Link to post
Share on other sites
Advertisement
I've seen benches for the i7 indicating that turning of HT can increase performance. Take a look at Tech Report's i7 review to see what I mean.

Perhaps, try this and see what happens?

Share this post


Link to post
Share on other sites
Yeah, sorry, I forgot I tried HT on and off, and it made 0 difference in this circumstance.

In fact, when I crank my raytracer up to 8 threads with HT on, the framerate in one of my tests goes from 130fps to 170fps, so it's a clear win (and 8 threads with HT disabled provides no performance increase over 4 threads..it stays at 130fps). If I could figure out my strange performance penalties going on at the low level, I imagine i could see 250fps tho.

Share this post


Link to post
Share on other sites
You need to give more details if you want some help speeding it up. For example posting the source and disassembly of the bit code that VTune found was slow would help.

If it's a caching issue I'd expect better performance with a smaller output resolution, less polygons (and smaller textures if you have textured objects in your scene).

- How do the cache miss rates differ between the two CPUs?

- If you do use textures are they swizzled for better locality?

- What's the total size of all the data used for rendering the scene?

Share this post


Link to post
Share on other sites
Given that the problem is reported to be writing ints to an array, I can think of several possible causes due to microarchitecture:
- cache/page splits due to unaligned writes
- TLB misses
- limited store buffers
All of the above can be ruled out because i7 is reported to handle them better or increase the number of resources vs. Core2. The remaining (and most likely) reason is the new cache structure, where L2 is now much smaller (4x256 KiB vs 2x6 MiB) but a few clocks faster. This is supposed to be offset by the new 8 MB L3 cache, but it is twice as slow as the old L2 and its effective size is < 7 MiB due to inclusion of lower levels.
Interesting situation; it looks like single-thread code with largish working sets may perform worse on i7 due to the new cache despite faster access to main memory and all the other improvements. The tables should however be turned if you change your benchmark to 4 threads (hopefully scheduled to individual cores), which will make the entire L2 available on i7 and cause some fighting on Core2's shared cache.

Share this post


Link to post
Share on other sites
Quote:
Original post by Adam_42
You need to give more details if you want some help speeding it up.


Agreed. This post was meant to simply ask if anybody else had seen similar end results...slower code on i7. I'm not expecting people to dig through my code and find the problem for me, I just want to see if I'm alone in this.

The scene is very small that it reads from, maybe 5kb of data (its only reflective spheres). The whole thing is very compact, which is why I'm thinking maybe everything fit into the l2 cache of the Core2, but now just barely doesn't on the i7? There are no textures or anything of that sort...literally 5kb of source data. For reference, this is the test scene (its rendered at 720p): http://jmx.ls1howto.com/rtrt/soft_shadows_small.jpg

VTune thus far hasn't been very helpful but I've not yet tried an instrumented build. I'll be exploring that more this weekend. I might even RTFM for VTune :) As a side note, I have found that the profilers that come with video game consoles have been way more helpful, and I'm kinda shocked that VTune didn't seem as nice, but I've maybe just not given it a chance yet.

Jan W, the Core2 version is faster than Corei7 with 1,2,3 or 4 threads. It doesn't matter. Once the i7 gets to 6 or 7 threads with HT, it finally wins out over the Core2 right now.


Anyway, I'll surely post my results on here when I figure out whats up, and hopefully have an example code snippet that illustrates the problem.

Share this post


Link to post
Share on other sites
Quote:
Original post by jmX
Anyway, I'll surely post my results on here when I figure out whats up, and hopefully have an example code snippet that illustrates the problem.


I'll be interested to see this.

Share this post


Link to post
Share on other sites
Ok, so here's the profiling of the trouble code on Core2 Q6600 (2700mhz):


And here's the profiling of the trouble code on Core i7 (2760mhz):


And for reference, its a 1024x512 raytrace of some spheres.


If VTune is to be trusted, there's some major issues going on inside that function with the branch prediction. Obviously there's a ton of branches, but that code really works great on the Core2. What gives?

Share this post


Link to post
Share on other sites
You can eliminate that branch with some bit manipulation trickery:


// declarations
int i1;
int[] distance;

// original
if (i1 < distance[r])
{
distance[r] = i1;
retval[r] = 1;
}

// branch free
int i1;
int[] distance;
int test = (distance[r] - i1) >> 31; // test = (distance[r] - i1) < 0 ? -1 : 0
distance[r] = distance[r] - ((distance[r] - i1) & test);
retval[r] |= test & 1; // Can drop the &1 if you're happy with -1 instead of 1

Share this post


Link to post
Share on other sites
I've never used VTune before, but does it have an option to display assembly instructions instead of annotated source? The accuracy of assigning ticks to source lines is not always accurate. And can you use the same counters for both runs?

Anyhow, if those results are accurate your problem could be branch aliasing. If your compiler supports branch hints you can try hinting the problem branches one way or another and seeing if things change on the i7. IIRC static branch hints override branch prediction so you should see different results.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!