So this depends on the operating system, as well as the API being used. I'm only familiar with how things work on Windows, so that's what I'll explain. If you want to consult the actual OS documentation, you'll want to go here.
On Windows Vista through Windows 8 the OS uses WDDM 1.x, where the OS pretty much completely controls VRAM allocation. To simplify things considerably, the way it works is that the OS will keep track of both the dedicated memory on the video card (VRAM), as well as a chunk of system memory that's usually around 2x the size of VRAM. Any apps that use the GPU will tell the driver to create "resources", where a resource is generally a single allocation (typically either a texture or a buffer). The app will then issue rendering commands, which causes the driver to create command buffers that are submitted to the GPU for execution. Whenever the driver submits one of these command buffers, it also has to submit a list of resources that are referenced by that command buffer. So if you issued a draw that uses TextureA and TextureB via shader resource views, the driver will include those resources in the list. This list serves two purposes:
- It allows the kernel-mode driver to patch physical addresses into the command buffer. Older GPU's didn't support virtual addressing, and required raw physical addresses provided by the OS's memory manager.
- It allowed the OS's memory manager to know which resources were needed for that command list to execute. The memory manager would use this list to shuffle resources in and out of VRAM. This allows for multiple apps to share the GPU without requiring the sum of all resources to fit in VRAM. It also potentially allows a single app to over-subscribe to the VRAM. Resources that are in VRAM are considered to be "resident", while resources that are paged out to system memory are considered "evicted".
In other words, the OS would try it's very best to make sure that the resources you're actually using stay in VRAM. In that way it's somewhat similar to CPU memory, which can get paged out to the swap file if it's not in use. However if your app (or multiple apps collectively) try to use too much memory simultaneously, it will typically manifest as poor performance resulting from the OS frantically moving things in and out of VRAM. In some cases it's also possible that a resource will get moved to an area of CPU memory that's still accessible by the GPU, albeit more slowly than if it were in VRAM. Either way there are tools that you can use to track this down.
On Windows 10 there's a new driver model called WDDM 2.0. Under this driver model GPU's are expected to have virtual addressing support, which simplifies things a bit. Using virtual addresses in command buffers avoids the need for patching, which saves on performance. If the app is using D3D12, control over residency is also given to the app instead of being automatic. The app can use the "MakeResident" and "Evict" functions to manually move resources (or heaps) in and out of VRAM. The OS also has a mechanism for notifying apps when their amount of available GPU memory is shrinking, which usually happens due to another app requesting VRAM. In this scenario the app is expected to destroy or evict resources, but in practice the OS will start to automatically evict if your app fails to do it. In this case the resource will most likely be slower to access, which can cause performance to degrade. D3D11 apps see the same behavior as they did previously (residency is automatic), but under WDDM 2.0 the driver is responsible for providing this behavior.