GPU ALU (computation) speeds keep getting faster and faster -- so if a shader was ALU-bottlenecked on an old GPU, on a newer GPU with faster ALU processing, that same shader might would likely become memory bottlenecked -- so faster GPUs need faster RAM to keep up
Any shader that does a couple of memory fetches is potentially bottle-necked by memory.
Say for example that a memory fetch has an average latency of 1000 clock cycles, and a shader core can perform one math operation per cycle. If the shader core can juggle two thread(-groups) at once, then an optimal shader would only perform one memory fetch per 1000 math operations.
e.g. say the shader was [MATH*1000, FETCH, MATH*1000] then the core would start on thread-group #1, do 1000 cycles of ALU work, perform the fetch, and have to wait 1000 cycles for the result (before doing the next 1000 cycles of work). While it's blocked here though, it will switch to thread-group #2, and do it's first block of 1000 ALU instructions. By the time it gets to thread-group #2's FETCH instruction, (which forces it to block/wait a 1000 cycle memory latency), the results of thread-group #1's fetch will have arrived from memory, so the core can switch back to thread-group #1 and perform its final 1000 ALU instructions. By the time it's finished doing that, thread-group #2's memory fetch will have completed, so it can go on finishing thread-group #2's final 1000 ALU instructions.
If a GPU vendor doubles the speed of their ALU processing unit -- e.g. it's now 2 ALU-ops per cycle, then it doesn't really make this shader go much faster:
The core initially does thread-group #1's first block of 1000 ALU instructions in just 500 cycles, but then hits the fetch, which will take 1000 cycles. So as above, it switches over to processing thread-group #2 and performs it's first block of 1000 ALU instructions in just 500 cycles... but now we're only just 500 cycles into a 1000 cycle memory latency, so the core has to go idle for 500 cycles, waiting for thread-group #1's fetch to finish.
The GPU vendor would also have to halve their memory latency in order to double the speed of this particular shader.
Increasing memory speed is hard though. The trend is that processing speed improves 2x every 2 years, but memory speed improves 2x every 10 years... in which time processing speed has gotten 32x faster... so over a 10 year span, memory speed tends to actually get 16x slower when compared to processing speeds
Fancy new technologies like HBM aren't really bucking this trend; they're clawing to keep up with it.
So GPU vendors have other tricks up their sleeve to reduce observed memory latency, independent of the actual memory latency. In my above example, the observed memory latency is 0 cycles in the first GPU, and 500 cycles on the second GPU, despite the actual memory latency being 1000 cycles in both cases. Adding more concurrent thread-groups allows the GPU to form a deep pipeline and keep the processing units busy while performing these very latent memory fetches.
So as a GPU vendor increases their processing speed (at a rate of roughly 2x every 2 years), they also need to increase their memory speeds and/or the depth of their pipelining. As above, as an industry, we're not capable of improving memory at the same rate as we improve processing speeds... so GPU vendors are forced to improve memory speed when they can (when a fancy new technology comes out every 5 years), and increase pipelining and compression when they can't.
On that last point -- yep, GPUs also implement a lot of compression on either end of a memory bus in order to decrease the required bandwidth. E.g. DXT/BC texture formats don't just reduce the memory requirements for your game; they also make your shaders run faster as they're moving less data over the bus! Or more recently: it's pretty common for neighbouring pixels on the screen to have similar colours, so AMD GPUs have a compression algorithm that exploits this fact - to buffer/cache pixel shader output values and then losslessly block-compress them before they're written to GPU-RAM. Some GPUs even have hardware dedicated to implementing LZ77, JPEG, H264, etc...
Besides hardware-implemented compression, compressing your own data yourself has always been a big optimization issue. e.g. back on PS3/Xb360 games, I've shaved a good number of milliseconds off the frame-time by changing all of our vertex attributes from being 32 bit floats, to being a mixture of 16 bit float and 16/11/10/8 bit fixed point values, reducing the vertex shader's memory bandwidth requirement by over half.