Mostly because very advanced developers and engineers can't agree whether the separation (DX11) or the merge (GL) is the best one. Arguments about being faster/more efficient, hardware friendly, clearer, and easier to use have been made for... both.
On current generation hardware, when programming for the GPUs directly (instead of programming for D3D/GL), there are structures that map to the GPU-native data structures used by the hardware when performing memory fetches. For simplicity, they can look like (where the number 4 is made up and may differ -- but is actually accurate for SamplerDesc on the AMD Southern Islands ISA):
struct TextureDesc { u32 registers[4]; };
struct SamplerDesc { u32 registers[4]; };
struct BufferDesc { u32 registers[4]; };
Internally, ID3D11SamplerState would contain one of these SamplerDesc structures. CreateSamplerState converts the platform-agnostic D3D11_SAMPLER_DESC structure into this GPU-specific structure.
Likewise, a ID3D11ShaderResourceView object contains either a TextureDesc or a BufferDesc, which in turn contains a pointer to the memory allocation, the format of the data, the width/height/etc...
When a shader is compiled into actual GPU-native assembly, it ends up with a per-draw-call header looking something like below (where the comments are metadata used by the driver to match up these header entries with API slots). Say our shader has 4 HLSL texture uniforms, 1 sampler uniform and 1 cbuffer uniform --
struct MyShaderNumber42
{
TextureDesc textures[4];//diffuse(location: t0), specular(location: t1), normal(location: t2), lightmap(location: t7)
SamplerDesc samplers[1];//smp_bilinear(location: s0)
BufferDesc buffers[1];//cbObject(location: b7)
};
The actual GPU-native shader assembly language has instructions to load data from textures/buffers without filtering -- these functions take a TextureDesc or BufferDesc (and an offset/coordinate) as parameters.
It also has instructions to load from textures with filtering -- these functions take a TextureDesc and a SamplerDesc (and an offset/coordinate).
Assuming you're using regular filtered texture sampling instructions, then whether or not you choose to use one-sampler-per-texture-slot or not has a huge impact on performance and the memory overhead incurred by each draw-call.
If you have 4 textures, but they all use the same filtering options, then one-sampler-per-texture-slot results in (sizeof(TextureDesc)+sizeof(SamplerDesc))*4 == 128 bytes of descriptors that have to be fetched per-pixel-wavefront.
With a shared sampler you get sizeof(TextureDesc)*4+sizeof(SamplerDesc) == 80 bytes of descriptor data to fetch per pixel.
In order for a pixel-shader to carry out a texture fetch, it's wavefront has to fetch the TextureDesc and SamplerDesc objects from memory first, so that it knows how to perform the fetch and where from.
Over a single full-screen 1080p draw-call, that's a saving of ~1.5MiB of memory bandwidth [(128-80 bytes) * 1920*1080 pixels / 64 pixels-per-wavefront] (or 89MiB/s if the game is running at 60Hz), simply by not fetching the useless sampler descriptors. (D3D style = ~2.5MiB/draw, GL-style = ~4MiB/draw)
That's a decent GPU-side saving... Even if I do exaggerate slightly -- because that number is assuming that there is no cache between the shader units and RAM. In practice, the actual traffic between the GPU-RAM and the shader-unit's L2 cache is going to be somewhat lower of course.
D3D11 makes this optimization simple for the driver-authors -- the HLSL shader code makes the separation explicit, so they know at compile time how many Texture and Sampler descriptor structures are required, and which objects are the inputs for each fetch instruction. They can generate the appropriate header and fetch instructions at compile-time.
GL makes the drivers very complex AFAIK, even with the separate sampler objects extension, GLSL still doesn't expose the separation, and still acts like there is one-sampler-per-texture...
This leaves the driver authors two choices:
1) They implement the sampler objects extension to let the users (i.e. us) pretend that we're using the new DX11 style of using separate samplers, but internally they still make one sampler descriptor for every texture and then just copy our shared sampler object's contents many times into duplicated descriptors. This option makes porting from DX11 easier, but incurs the stupid GPU-side penalties described above (and the tiny CPU-side per-draw-call overhead of duplicating the sampler objects about the place).
2) They do actually send the minimal number of SamplerDesc structures to the GPU, like DX11 can easily do. However, this is very complex as the shaders have been written assuming one sampler-per-texture-slot, which means that when choosing this option, the driver authors can't fully pre-compile the shaders into GPU-specific assembly ahead of time. So... in order to do this, at draw-call time they have to analyze the currently bound objects and find the unique set of samplers. They then need to potentially patch the shader ASM, generating a smaller header with the right number of SamplerDesc structures, and fix up all of the texture-fetch instructions to reference the correct SamplerDesc strucure within the header. They'll then need to cache that modified permutation of the shader so that it can be quickly fetched next time there's a draw-call with the same kind of sampler bindings... This is exacerbated by the fact that if a user has a HLSL shader that uses one texture with two different samplers, then in their GLSL port, they'll end up with two textures to represent that! So, this advanced GL driver also needs to realize that the unique set of bound textures is smaller than the number of texture-slots described by the shader and take this information into account when recompiling/reoptimizing the shader as well...
Needless to say, that's a huge first-time-draw-call overhead (and moderate every-time-draw-call overhead) to perform an optimization that should be dead simple and done once at shader-load time. You should not be compiling shader code inside your draw-calls... Another reason why it's important to 'prime' the driver by drawing every object using it's shader and all of it's potential pipeline states once at load time, to ensure the driver has actually finished generating and caching all the code it needs.
If your GL drivers are advanced enough to use option #2, then using the sampler objects extension may help improve GPU-side performance, but it may come at a cost of occasional CPU-side driver-time spikes due to shader patching... If your GL drivers are using option #1, then it really makes no difference whether you use the sampler objects extension or not.
So basically: separate sampler states are a great choice on modern GPUs, however, it likely doesn't actually matter whether you use them or not under GL because you're probably screwed either way.