Programming Challenge 1

Started by
50 comments, last by Zahlman 16 years, 8 months ago
Quote:Original post by jpetrie
Quote:
cant you just memset that entire block of memory?

You'd want to use fill_n or similar, not memset, but yes, that also achieves the solution of improving cache coherency.


Indeed. Optimized for style:

const int size = 1024;// Initialize iloveseven to an array of 7sint iloveseven[size][size];std::fill_n(iloveseven, size * size, 7);
Advertisement
iloveseven.cpp :

int iloveseven[1024][1024] =
{
{ 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ..., 7, 7, }
{ 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ..., 7, 7, }
{ 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ..., 7, 7, }
....
{ 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ..., 7, 7, }
};

fin.

GENiloveseven.cpp :

// generate iloveseven.cpp
...

fin.

This is a fast iloveseven filling no ? May we consider that the asm code generate from iloveseven.cpp fill a 1024*1024 memory array fastly ?
Quote:Original post by Zahlman
Quote:Original post by jpetrie
Quote:
cant you just memset that entire block of memory?

You'd want to use fill_n or similar, not memset, but yes, that also achieves the solution of improving cache coherency.


Indeed. Optimized for style:

const int size = 1024;// Initialize iloveseven to an array of 7sint iloveseven[size][size];std::fill_n(iloveseven, size * size, 7);

I don't know, but fill_n could actually have been implemented as a loop (I suspect they use different strategies for different cases).
And then there are the optimizations done by the compiler. Relative small arrays may get inlined like:

array[0] = 7;
array[1] = 7;
array[2] = 7;

instead of a loop.
Quote:
Relative small arrays may get inlined like:

Perhaps, but why worry?

Optimizing for what you think the compiler will generate in terms of machine code is fruitless. Unrolling the loop may or may not be done by the compiler, so it shouldn't be considered a reasonable "optimization" to attempt to the C++ code.

The only real sane optimization you can make is the observation that the algorithm in question suffers from poor locality (it doesn't traverse the storage linearly). Worrying about unrolling the loop or other things where you try to cajole the compiler into generating better assembly is problematic because they're not general-purpose solutions, whereas improving the algorithm is.

Besides, it's also worth nothing that if the array is small enough to be trivially manually inlined, it's probably small enough that the cache coherency issue isn't relevant. That applies mainly to large datasets.
Quote:Original post by jpetrie
Quote:
Relative small arrays may get inlined like:

Perhaps, but why worry?

I don't. Not in real life, at least.

It's a challenge, I have no idea how sane the solutions have to be. Nor do I know if it should be optimized for size or for speed (I suspect the latter), or whether the optimizations are allowed to be limited to a single platform. I merely indicating that the gains in performance (if taken to the extreme) can be different from what could be expected by reading the code.
Quote:Original post by WanMaster
I don't know, but fill_n could actually have been implemented as a loop (I suspect they use different strategies for different cases).


It probably does. Do I care? Not really. Using a plain loop probably is fastest there (especially since we have to "flatten" the array to pass it to fill_n like this, so everything will necessarily be iterated over in order). But anyway, it's the compiler writer's job to fix that. I don't have any information that really helps optimize things here beyond a general-case fill_n algorithm, so I might as well use it. To optimize, you must have a specific case for which to optimize, and it has to be special enough to be optimizable.

Quote:And then there are the optimizations done by the compiler. Relative small arrays may get inlined like:

array[0] = 7;
array[1] = 7;
array[2] = 7;

instead of a loop.


Using std::fill_n doesn't prevent optimizations of this sort, because for example the compiler could inline a loop taken from the fill_n implementation, and then unroll the loop.
The obvious solution is to unroll the loop in its entirety. It'll only be a million lines, and anyway you can generate them programmatically. [nods]
To win one hundred victories in one hundred battles is not the acme of skill. To subdue the enemy without fighting is the acme of skill.
Quote:Original post by King of Men
The obvious solution is to unroll the loop in its entirety. It'll only be a million lines, and anyway you can generate them programmatically. [nods]


I realise you're joking, but that wouldn't even be the most efficient way because it will probably raise problems with the trace cache on IA-32 and IA-64 platforms (see section 8.7.1, Intel® 64 and IA-32 Architectures Optimization Reference Manual)[1]

Quote:8.7.1 Avoid Excessive Loop Unrolling
Unrolling loops can reduce the number of branches and improve the branch predictability of application code. Loop unrolling is discussed in detail in Chapter 3. Loop unrolling must be used judiciously. Be sure to consider the benefit of improved branch predictability and the cost of increased code size relative to the Trace Cache.

User/Source Coding Rule 36. (M impact, L generality) Avoid excessive loop unrolling to ensure the Trace cache is operating efficiently.

On HT-Technology-enabled processors, excessive loop unrolling is likely to reduce the Trace Cache’s ability to deliver high bandwidth μop streams to the execution engine.
[TheUnbeliever]
Quote:Original post by TheUnbeliever
[...]
[...]FilledContainer<int> foo(7);
Your code isn't equivalent because of the subscript behavior. You need:
FilledContainer<FilledContainer<int> > iloveseven(FilledContainer<int>(7));

iloveseven[x][y] now always returns 7 =-)
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Hm, this challenge is a bit akward.

Optimizing for speed without knowledge about the compiler, the language and the architecture is impossible.

Assuming equal syntax for all languages:
Using array[outerloop][innerloop] is SLOWER in fortran, because fortran has columnstructure in the memory, however, using C or something like that is faster using C or similar languages. Thus, everyone somehow slowed down the thing? ;)

Furthermore, if I use gcc, then the thing is as optimized as possible.
Afaik, the gcc is able to recognize this situation and just flip the loop indices (not talking about loop unrolling).

After this, the real fun with architectures begins. If you unroll the loop manually, you might be doomed again, even on usual machines, because they might have asynchronus memory access chips (maybe even several of those). In that case, some fancy lop vectorisation might begin to start massive speedups to the loop opposed to the unrolled version.

Thinking further, you might even try to build the thing as a multiplication of a matrix and a scalar value. This might be done by a graphics processor (that is, a processor specialized in vector and matrix operations) and finish even faster.

Or you might just store the whole thing in your binary and memcopy a huge batch of 7s to that location (yea, hacking with microcontrolers made my dark side stronger).

Thus, if you want to have a challenge of optimization, you should specify the surroundings more precisely, otherwise it is as fast as possible, because I am never sure if I dont slow it down.

This topic is closed to new replies.

Advertisement