A) if a texture needed for a triangle is stored in VRAM, that means that when using a text2d(...) instruction within the shader code, the GPU stalls waiting to get the appropriate pixel from VRAM, am I right?... or does the whole texture get stored in cache?... if so, that means that all texture used are stored in cache (bump, diffuse, etc)?
B) when rendering, the GPU needs to write on the appropriate render target, would the whole RT be also on a local cache?... so that menas that when changing RT's it needs to send the old RT to VRAM and bring the new one to cache?
C) when changeing render states, I beleive this would be a matter of just changeing a flag in the GPU, so that wouldn't cause any performance issues, would it?... that is, I could go crazy changing states without changeing RT or textures or shader code and it would not have any relevant penalty, right?
D) if VRAM runs out of space, the textures, would be stored in System RAM?
C) Pixels are batched up into "segments" on the GPU-side. If multiple successive draw-calls have the same state, then their pixels will probably end up in the same "segment". Some state changes will force the end of a segment and the start of a new one, while other state-changes won't. There's no rules here, each card may be different. Generally, bigger changes, like changing the shader program will definately end a segment, while smaller changes, like changing a texture may not.
Also, as mentioned by AliasBinman, changing states may have a significant CPU-side overhead within the driver or API code.
A) As above, when processing pixels, the GPU has a whole "segment" worth of pixels that need to be processed. It can break the pixel shader up into several "passes" of several instructions each, and then perform pass 1 over all pixels in the segment, then pass 2, and so on.
For example, given this code, and the comments pretending how it's been broken up into passes:
float3 albedo = tex2D( s_albedo, input.texcoord ).rgb;//pass 1 albedo = pow( albedo, 2.2 );//pass 2 return float4(albedo,1) * u_multiplier;//pass 3
Say we've got 400 pixels, and 40 shader units, the GPU would be doing something like:
for( int pass=0; pass != 3; ++pass ) for( int i=0; i<400; i+=40 ) RunShader( /*which instructions*/pass, /*which pixel range*/i, i+40 );
So to begin with, it executes pass#1 - issueing all the texture fetch instructions, which will read texture data out of VRAM (or the cache) and write that data into the cache. Then after it's issued the fetch instructions for pixels #360-400, it will move onto pass #2 for pixels #1-40. Hopefully by this point in time, the fetch instructions for these pixels have completed, and there's no waiting around (if the fetches are still in progress, there will be a stall). Then, after this pass has performed all it's pow calls, the next pass is run, which does some shuffling and multiplication, generating the final result. These results are then sent to the ROP stage.
The bigger your "segments", the more able the GPU is able to hide latency by working on many pixels at once. Shaders that require a lot of temporary variables will reduce the maximum segment size, because the current state of execution for every pixel shader needs to be saved when moving on to other pixels (and more temporary variables == bigger state). Also, certain state-changes -like changing shaders- will end a segment. So if you have a shader with lots of fetches, you want to draw hundreds (or thousands) of pixels before switching to a different shader.
B) Some GPUs work this way, especially older ones, or ones that boast having "EDRAM" -- there's a certain (small) bit of memory where render targets must exist to be written to. When setting a target, it has to be copied from VRAM into this area (unless you issue a clear command before drawing), and afterwards it has to be copied from this area back to VRAM (unless you issue a special "no resolve" request). On other GPUs, render-targets can exist anywhere in VRAM (or even main RAM) and there is no unnecessary copying. The ROP stage will perform buffering of writes to deal with the latency issues, similar to the above ideas in (A).
D) This depends on the API, driver and GPU. On some systems, the GPU may be able to read from main RAM just like it reads from VRAM, so storing texutres in main RAM is not much of a problem. On other systems, the driver will have to reserve an area of VRAM and move textures back and forth between main/VRAM as required... On other systems, texture allocation may just fail when VRAM is full.
* Disclaimer -- all of this post is highly GPU dependent, and the details will be different on different systems. This is just an illustration of how things can work.