Wow, thanks for both the lecture and the pointers.
In the mean time I read some articles did some thinking and when through your suggestions one by one.
To begin with I agree with your remark on micro optimizations, I honestly don't now if I need them. I'm a bit anctious because of the specs of my own system and not yet reference tests on older CPU/GPU's (I have I5 2320, 660GTX 2GB, 8GB ram, Win7).
Here's what I'm gonna do/ and a few questions. If you have another minute.... really appreciated 
Actions;
1 - I will give meshes an ID to be able to render meshes with the same vertex/ index buffer contents and material (will save some state changes definitely)
(although in memory they still have individual buffers.. hm)
2 - Shared parameters; I believe in my situation 'ViewProj' matrix is the only one thats shared, will implement that (quick win)
3 - Will dig into renderstates setting/ changes with PIX, not sure what's going on. I use D3DXDONOTSAVECHANGES and after shader rendering set my default renderstates (six of them). Although commenting this function/ not doing this, gives the same end result (?). I'll look into the article link you posted
4 - save lots of "if statements"/ CPU load by making indexes with meshes/entities per material (already have it, only needs to be sorted and moved into arrays with more columns)
5 - I just 'fixed' metrics/scaling and now have a scene of 70x70 meters (small desert village), I'll add 8 sand hill instances around it (with some trees), so I have 9 'subscenes'/partitions or how you'd call it.
6 - prefer looping through materials firsts and afterwards on meshes. This will save setting parameters for materials, but increase setting the meshes (world matrix, streamsource etc.), since one mesh might have entities with different materials). Is it correct to assume material setting in an effect is less performance eating then setting a mesh with it's parameters?
Questions:
1 - what's the advantage of multiplying world matrix for each mesh, with viewprojection and then pass in only the endresult to the shader?
(compared to doing the multiplication in the shader), does this take 'CPU' time and free 'GPU' time?
I know do this and could change it accordingly (depending on the gain);
* float4 worldPosition = mul(input.Pos, World);
* Out.Pos = mul(worldPosition, ViewProj);
2 - spatial devision.
I see a few options/ ideas I have:
* build up the 'subscenes'/areas while loading a scene, for example 100x100m is a scene
* check camera position against areas/ spaces and cull on this VERSUS cull the areas based on camera lookat vector and frustum
* render only the active area versus this one + the next one facing the camera
(1st option asks from modelling that I 'block' the views to the next area's.
3 - sorting models by geometry.
How you explain it, I could set streamsource and indices just once for multiple meshes (sharing parameters like effect, technique and texture/ material).
Most meshes have their own world matrix, I therefor don't see how to do this. Because I need to set the world matrix anyhow (unless I combine mesh vertexbuffers and indices and one 'general' world matrix for this set of meshes in one buffer? (sounds way to complex for me looking at the possible not necessary micro optimizations
)
4 - checking by redundant vertexbuffer (/indexbuffer) setting; this sounds like not necessary when sorting meshes is correct.
Is this correct or are there other reasons to do this?
5 - batching; I'm gonna check how much triangles I render per draw call, just out of curiosity. I read that drawcalls should be reduced much as possible, with more triangles per draw call (because a draw call will relatively take the same time with more triangles, thus increasing performance). Might this also be a reason why to combine meshes into combined vertex/indexbuffers and shared world matrix?
Looking forward to your answers and ideas.
I'm also curious what hardware/ specs you have, maybe to do a reference tests after my optimizations.