At the moment, I'm not working on instancing, and I'd really like to know why the parallel code experiences a sudden drop in speed.
You mentioned that the code chunk you posted takes 48% of the time in one test. Do you know for sure that it is the part that is growing out of proportion to the other part as x increases? Like, as you increase x does that percentage go up, or down, or stay the same?
I would time the first parallel loop, the second parallel loop, and your actual rendering code, all separately, and see which one is growing faster than the others as you increase x. You don't even need SlimTune, just use a System.Diagnostics.Stopwatch and draw the times, or percentages of frametime, on the screen. That way you can at least verify that you are targeting exactly the right section.
Also, if you are not doing it already, I would make sure you compare tests with the same percentage of objects visible -- either all, or none, or some constant value like 50%. If you compare 60% objects visible at x=27 to 40% objects visible at x=28 you will see changes in the relative timing of different sections that are only based on that percentage, not on x.
Also consider memory allocation and garbage collection. As you start allocating bigger chunks of RAM it starts getting more and more expensive, and the GC may start to thrash more and more often. One VERY SUSPICIOUS fact is that an allocation of 27^3 32-bit references is on the order of the size things start getting put in a separate large object heap (85,000 bytes.) I wrote a program to time how long it took per allocation averaged over 100,000 allocations, and I get this chart:
(EDIT: The graph is actually milliseconds for a pair of allocations -- first allocating an array of that size, and then calling ToList on it.)
Notice that the Y axis is milliseconds, though -- so if it is affecting you it is probably because the number of objects in your heap is much larger than my test program's, or you allocate many times per frame.
Maybe try and see how frequently GCs are happening. There are some performance counters that can tell you lots of details about this. GCs can strike anywhere in your main loop even if they're caused by allocations in a localized position so it would be useful to rule that out first.