slow speedup of cpu's ?

Started by
17 comments, last by tonemgub 10 years, 7 months ago
Considering my latest programs have been all bandwidth bound after optimizing them to fully take advantage of the cpu, I'm more interested in bandwidth gains, rather than cpu performance increase.
The world needs more bandwidth now, not more computing powerp
Advertisement

There was a good article about a decade ago, "The Free Lunch Is Over" that you should probably read.

Moore's law, the observation that the number of transistors in an integrated circuit doubles about every two years, continues to hold true.

For 30 years or so that doubling was applied to a single processor. Programmers got a "free lunch" because the processors running their code became faster. You could design software that was more advanced than your hardware and be assured that by the time it launched the hardware would be able to handle it.

Starting about 15 years ago that "free lunch" began to end. Processor speed is still increasing but it is happening as a lateral change. Instead of having a single processor with improving performance, we have a growing number of processors with similar performance.

Parallel algorithms are harder to write, but can achieve both linear speedup and even superlinear speedup thanks to cache effects, accumulated memory, and other properties of parallel hardware.

The improvements in performance are still there, we just need to work harder to get them.

Considering my latest programs have been all bandwidth bound after optimizing them to fully take advantage of the cpu, I'm more interested in bandwidth gains, rather than cpu performance increase.
The world needs more bandwidth now, not more computing powerp

I think 'we are talking here aobout overall cpu performance not just

'computational without bandwidth' - I agree that bandvith (this is mov speed is most important)

EDIT

After all - what is the reason that movs cant be parallellised ?

for example if I would have such code

mov mem1, 0

mov mem2, 0

mov mem3, 0

mov mem4, 0

mov mem5, 0

mov mem6, 0

mov mem7, 0

mov mem8, 0

run in 'parrallel' it would be 8x bandwidth speedup but it doeas not work this way - it is a trouble with CPU or RAM-chip or what?

There was a good article about a decade ago, "The Free Lunch Is Over" that you should probably read.

Moore's law, the observation that the number of transistors in an integrated circuit doubles about every two years, continues to hold true.

For 30 years or so that doubling was applied to a single processor. Programmers got a "free lunch" because the processors running their code became faster. You could design software that was more advanced than your hardware and be assured that by the time it launched the hardware would be able to handle it.

Starting about 15 years ago that "free lunch" began to end. Processor speed is still increasing but it is happening as a lateral change. Instead of having a single processor with improving performance, we have a growing number of processors with similar performance.

Parallel algorithms are harder to write, but can achieve both linear speedup and even superlinear speedup thanks to cache effects, accumulated memory, and other properties of parallel hardware.

The improvements in performance are still there, we just need to work harder to get them.

It ended maybe not 15 years ago but maybe with pentium 4 and netburst architecture it ismaybe about 10 years ago (not sure but think that in 1998 it yet still goes further quick)

The trouble with many cores is that i think that you have no

general guarantee that every your code can be rewritten to

all the cores you wish and will gain such linear speedup (apart

from that that I still only see a 4 o 6 cores through all the years

and it is not very very big impressive)

I am not sure as to this guarantee but maybe some codes

are inherently limited to such many core rewriting and

paralellistaion - so there will be no speedup from cores at

all over soe limit to acheive)

The fundamental reason for CPU slowdown is interconnect speed (the delay through wires connecting two or more transistors). Two parts of a circuit path contribute to the delay of a circuit: transistor speed and wire speed. As processes scale down, the transistor continues to deliver faster and faster performance. Wires also benefit from this speed-up but at a slower rate proportional to the transistor.

Many reasons account for this difference. The distance between wires (wire pitch) must be large enough to avoid electro-magnetic cross talk -- the ability of a moving current to induce a magnetic field which modifies adjacent wires' magnetic fields and thereby their current. With enough noise, a downstream transistor might fire erratically. The common fix is to either space the wires further apart, because this effect is inversely proportional to wire separation distance, or insert a shielded wire between the two. Either way, you are now using up more space.

Even though the transistors are smaller, this extra wire spacing places them further apart (proportionally from prior technology versions) and slows their operational speed.

Also, wire pitch itself is worse off. At 20nm light diffraction patterns occur as in the classic double slit diffraction experiment. This is because device features are approaching the wavelength of light. Currently, fabrication plants use double patterning techniques to get diffraction patterns to overlap sufficiently. Double patterning places severe restrictions on VLSI design which affects transistor density.

Caching and memories have long had to deal with wire issues and it's the primary reason why they are slow -- a large capacitance is charged or drained over long wires (think RC time constants).

This is the main reason CPU designs have shifted to more parallel based designs. They can no longer reap the benefits of simple dimensional scaling.

Several stopgaps are being investigated: X-ray lithography (use X-rays instead of light to etch features because wavelength is much smaller). E-beam technology, where a stream of electrons does the etching. Most are far from production.

I predict someone will make carbon nanotubes a viable process or some such technology. When that happens you'll see one final discrete maximum frequency jump in CPU clock speed followed by very marginal gains thereafter.

I am interested in seeing how the ARM and x86 battle plays out. Will the x86 prefetch and decode hurt bandwidth too much? Is it insurmountable? Sure Intel may claim higher clock speeds, but it really doesn't matter if you take two clock cycles where a different architecture could do the same with a slightly slower clock but in one cycle.

Considering my latest programs have been all bandwidth bound after optimizing them to fully take advantage of the cpu, I'm more interested in bandwidth gains, rather than cpu performance increase.
The world needs more bandwidth now, not more computing powerp

I think 'we are talking here aobout overall cpu performance not just
'computational without bandwidth'

They're not together. When Intel talks about making a chip faster, it usually doesn't account memory bandwidth because they're not in charge of that.
Of course, they do have a division dedicated to memory and then mandate chipset specs so that their CPUs won't be paired together with motherboards & memory modules that won't deliver and make the company look bad.
But there's a limit of what they can do.
In all case, you're talking about overall PC performance. Not CPU.

For example Sony is pairing their AMD chips with GDDR5 memory on their upcoming ps4 with 176GB/s of bandwidth (vs 25 GB/s latest DDR3 are offering, but it's common to see MUCH less than that in ordinary PCs)

So, if an algorithm of mine is severely bw limited and suddenly it runs 7 times faster on ps4, it's not fair to say amd did a good job optimizing their CPUs, because that's not what happened.
In fact, the same cpu (or very similar) is running on XBox One, and the performance wouldn't be the same.

(don't extrapolate this into a PS4 vs XBox One war because it's not. PS4's GDDR5 bandwidth is shared with the GPU, whereas Xbox's is not; furthermore there's the possibility another algorithm is latency bound, in which case XBox would outperform)

I agree that bandvith (this is mov speed is most important)

EDIT

After all - what is the reason that movs cant be parallellised ?
for example if I would have such code

mov mem1, 0
mov mem2, 0
mov mem3, 0
mov mem4, 0
mov mem5, 0
mov mem6, 0
mov mem7, 0
mov mem8, 0

run in 'parrallel' it would be 8x bandwidth speedup but it doeas not work this way - it is a trouble with CPU or RAM-chip or what?

Mov speed has little to do with it. Running 8 movs in 8 cores is like moving to another house with lots of furniture, having 8 workers but only one delivery truck. Adding more workers won't speed it up. Having more trucks (or a faster one) will.

Parallelizing movs can only improve peformance if the CPU can't issue enough mov instructions per second to saturate the bandwidth. This will depend on pipeline depth and cpu frequency.
Another way to optimize it, is replacing 4 movs for one movaps. This can solve the issue long before having to go multicore.
Last but not least, in the x86 case there's the thing with cache trashing and non temporal moves, in which case movntps (for aligned memory) and movnti (for safer, 32-bit moves).
You will find this blog post about achieving maximum memory bandwidth interesting. I reproduced the same results independently before finding that blog post, so IMHO it's quite accurate.

The days where a process node shrink would allow a processor to be clocked 2 times faster are long gone (ending probably around the 4x86 and Pentium era). In the intervening years, all the low-hanging fruit has already been picked and eaten by processor manufacturers. There's also the heat-wall, where we can't simply clock things much faster without heat and power consumption getting out of hand again. Today's meager gains are mostly through small and painstakingly-researched tweaks to various internal buffers, cleverer and cleverer execution of microcode, and just having a few percent more transistors to throw around with each new node.

Something to think about is that literally only 1% (yes, literally) of your CPU's transistor count is actually responsible for calculating things at all. The entire other 99% of the chip is nearly entirely dedicated to hiding latency--caches, TLBs, branch prediction, instruction re-ordering, hyperthreading, register renaming, etc -- Only about 9% of the 99% has to do with general chip function as you probably think about it (logical registers, x86->microcode decoding, I/O, memory controllers, etc). In other words, the mental-model most people have for the structure of CPUs, and where most people mistakenly think that performance gains come from, only represents about 10% of the transistors, and that's all been pretty well picked-over by processor engineers, so there are very few attainable gains to be had there.

throw table_exception("(? ???)? ? ???");

Waiting for my quantum holocore CPU....

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
Check this out: http://msdn.microsoft.com/en-us/library/gg615082.aspx (Scroll down to the tables with test results.)

This topic is closed to new replies.

Advertisement