Anybody else seeing their code running slower on core i7?

Started by
16 comments, last by gran sveti 15 years, 2 months ago
I've had a raytracer I wrote a year ago, and it ran fairly well on my q6600 core 2 intel. I've recently built a core i7 machine, and to my surprise, the raytracer that ran 49fps now runs about 35-40fps. I wanted to track this down. To test further, I setup the core i7 at 2.7ghz, and I set my core2 at 2.7ghz. Then, I changed my raytracer to be a single thread. So this should be apples to apples comparison. The result was the Core2 ran at 14fps, and the Corei7 ran 8.3fps. What the heck? I installed vtune, and the problem area seems to be where I write an int to an array deep inside the raytracer in a function that is called hundreds of thousands of times. The only explanation I can come up with is that maybe that array is not in L2 cache on the i7 because its cache size is smaller? Vtune so far hasn't given very exacting results, so I'm not sure how much to trust it, but that's where I'm at. Which brings me to the question...anybody else seen similar results with some of their code on the new core? PS, all other benchmarks run faster on the i7, including a friends raytracer (altho it wasn't a realtime tracer).
Advertisement
I've seen benches for the i7 indicating that turning of HT can increase performance. Take a look at Tech Report's i7 review to see what I mean.

Perhaps, try this and see what happens?
Yeah, sorry, I forgot I tried HT on and off, and it made 0 difference in this circumstance.

In fact, when I crank my raytracer up to 8 threads with HT on, the framerate in one of my tests goes from 130fps to 170fps, so it's a clear win (and 8 threads with HT disabled provides no performance increase over 4 threads..it stays at 130fps). If I could figure out my strange performance penalties going on at the low level, I imagine i could see 250fps tho.
You need to give more details if you want some help speeding it up. For example posting the source and disassembly of the bit code that VTune found was slow would help.

If it's a caching issue I'd expect better performance with a smaller output resolution, less polygons (and smaller textures if you have textured objects in your scene).

- How do the cache miss rates differ between the two CPUs?

- If you do use textures are they swizzled for better locality?

- What's the total size of all the data used for rendering the scene?
Given that the problem is reported to be writing ints to an array, I can think of several possible causes due to microarchitecture:
- cache/page splits due to unaligned writes
- TLB misses
- limited store buffers
All of the above can be ruled out because i7 is reported to handle them better or increase the number of resources vs. Core2. The remaining (and most likely) reason is the new cache structure, where L2 is now much smaller (4x256 KiB vs 2x6 MiB) but a few clocks faster. This is supposed to be offset by the new 8 MB L3 cache, but it is twice as slow as the old L2 and its effective size is < 7 MiB due to inclusion of lower levels.
Interesting situation; it looks like single-thread code with largish working sets may perform worse on i7 due to the new cache despite faster access to main memory and all the other improvements. The tables should however be turned if you change your benchmark to 4 threads (hopefully scheduled to individual cores), which will make the entire L2 available on i7 and cause some fighting on Core2's shared cache.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by Adam_42
You need to give more details if you want some help speeding it up.


Agreed. This post was meant to simply ask if anybody else had seen similar end results...slower code on i7. I'm not expecting people to dig through my code and find the problem for me, I just want to see if I'm alone in this.

The scene is very small that it reads from, maybe 5kb of data (its only reflective spheres). The whole thing is very compact, which is why I'm thinking maybe everything fit into the l2 cache of the Core2, but now just barely doesn't on the i7? There are no textures or anything of that sort...literally 5kb of source data. For reference, this is the test scene (its rendered at 720p): http://jmx.ls1howto.com/rtrt/soft_shadows_small.jpg

VTune thus far hasn't been very helpful but I've not yet tried an instrumented build. I'll be exploring that more this weekend. I might even RTFM for VTune :) As a side note, I have found that the profilers that come with video game consoles have been way more helpful, and I'm kinda shocked that VTune didn't seem as nice, but I've maybe just not given it a chance yet.

Jan W, the Core2 version is faster than Corei7 with 1,2,3 or 4 threads. It doesn't matter. Once the i7 gets to 6 or 7 threads with HT, it finally wins out over the Core2 right now.


Anyway, I'll surely post my results on here when I figure out whats up, and hopefully have an example code snippet that illustrates the problem.

Quote:Original post by jmX
Anyway, I'll surely post my results on here when I figure out whats up, and hopefully have an example code snippet that illustrates the problem.


I'll be interested to see this.
Ok, so here's the profiling of the trouble code on Core2 Q6600 (2700mhz):


And here's the profiling of the trouble code on Core i7 (2760mhz):


And for reference, its a 1024x512 raytrace of some spheres.


If VTune is to be trusted, there's some major issues going on inside that function with the branch prediction. Obviously there's a ton of branches, but that code really works great on the Core2. What gives?
You can eliminate that branch with some bit manipulation trickery:

// declarationsint i1;int[] distance;// originalif (i1 < distance[r]){  distance[r] = i1;  retval[r] = 1;}// branch freeint i1;int[] distance;int test = (distance[r] - i1) >> 31; // test = (distance[r] - i1) < 0 ? -1 : 0distance[r] = distance[r] - ((distance[r] - i1) & test);retval[r] |= test & 1; // Can drop the &1 if you're happy with -1 instead of 1
I've never used VTune before, but does it have an option to display assembly instructions instead of annotated source? The accuracy of assigning ticks to source lines is not always accurate. And can you use the same counters for both runs?

Anyhow, if those results are accurate your problem could be branch aliasing. If your compiler supports branch hints you can try hinting the problem branches one way or another and seeing if things change on the i7. IIRC static branch hints override branch prediction so you should see different results.

This topic is closed to new replies.

Advertisement