# My Terain's multi-texture pixel shader is killing me!(FPS)

This topic is 4791 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm using a pixel shade to do custom mutli-texturing. What I'm doing is using a texture map This Texture map is then sampled 4x in the pixel shader and added to the pixel to give this result on my terrain. Each vertex on my terrain can have one texture assigned to it. If a quad of vertices each has a different texture it will interpolate between the different textures within the texturemap. Problem: I'm gettin 20 FPS in release mode. If I change the pixel shader to output a constant color it jumps to 200 fps. So I've found the bottleneck. Just not sure how to optimize it. Here's my code for the pixel shader.
//	The terrain's Pixel shader works on sampling from a texture map. Which is 4 textures
// combined into one 512x512 texture map.
//
// The vertices look like this
// *-*
// |\|
// *-*
//
// Tex0 is the U,V's for the texture map in the format
// 0,0  1,0
// 0,1  1,1
//
float4 TerrainPS( float4 Normal : TEXCOORD0, float2 Tex0 : TEXCOORD1,
float2 Tex1 : TEXCOORD2,
float2 Tex2 : TEXCOORD3 ) : COLOR
{
// This calculates all the texture coordinates
// To find the first texture coordinate, simply divide by two.
// The others you need to offset by 0.5f to get their correct index.
float2 texCoord0 = Tex0 / 2.0f;

float2 texCoord1 = texCoord0;
texCoord1.x += 0.5f;

float2 texCoord2 = texCoord0;
texCoord2.y += 0.5f;

float2 texCoord3 = texCoord0;
texCoord3.x += 0.5f;
texCoord3.y += 0.5f;

// Sample the texture map by the correct texture coords.
float4 texColor0 = tex2D( S0, texCoord2 );
float4 texColor1 = tex2D( S0, texCoord1 );
float4 texColor2 = tex2D( S0, texCoord0 );
float4 texColor3 = tex2D( S0, texCoord3 );

// This is the multi-texturing stage process.
//
// Tex1 and Tex2 represent the four textures
// Texture 0 == Tex1.x
// Texture 1 == Tex1.y
// Texture 2 == Tex2.x
// Texture 3 == Tex2.y
//
// They are set to 1 at the vertex if you painted a texture there
// Otherwise they are set to 0. This lets the texture fade across.
texColor0 *= Tex1.x;
texColor1 *= Tex1.y;
texColor2 *= Tex2.x;
texColor3 *= Tex2.y;

float4 texColor = texColor0 + texColor1 + texColor2 + texColor3;

// Note Normal is the diffuse component calculated in the vertex shader.
return texColor * Normal; //+ ambientMtrl * texColor;
}



##### Share on other sites
BTW- I've moved calculating the four different textured coordinates in the pixel shader up to the vertex shader and I get the same FPS results.

I tried sampling once and I got something like 80 FPS which isn't too bad. But I need to sample 4 times.

##### Share on other sites
I reduced the texture map to 128x128 and it gave a small boost in FPS. Though the quality really shows.

##### Share on other sites
Why can't you calculate texcoord1->texcoord3 in the vertex pipeline and interpolate them through regular texture coordinate indices? Adding the same constant for each pixel is totally redundant...

I mean, hell, it looks like what you're doing there could be achieved using the fixed function pixel pipeline. Accumulate four texture reads using coords generated in the vertex shader, followed by a multiply by the interpolated diffuse value.

##### Share on other sites
My terrain currently does a similar type of blending, only the blend factors are stored per vertex instead of in a texture map. But I get the same problem nonetheless.

The problem is you are burning up fill rate. What class video card are you using? On my geForce 3 (4 simultaneous texture units) I get around 80-100 FPS when rendering 65,000 triangles of terrain @ 640x480. But the bottle neck is definitely sampling all those textures. You can test by just changing the device resolution and see if it has a big impact on performance.

The only real solution that I have come across is to just use one large color texture over the whole terrain with a detail texture to reduce the blurriness. You will need to figure out how to reduce the number of textures that you are using.

Hope this helps.

##### Share on other sites
Quote:
 Original post by superpigWhy can't you calculate texcoord1->texcoord3 in the vertex pipeline and interpolate them through regular texture coordinate indices? Adding the same constant for each pixel is totally redundant...

Yah I just did that - though it didn't really change the FPS by much more then a few FPS.

Here's my updated code.
////	The terrain's Pixel shader works on sampling from a texture map. Which is 4 textures// combined into one 512x512 texture map.//// The vertices look like this// *-*// |\|// *-*//// Tex0 is the U,V's for the texture map in the format// 0,0  1,0// 0,1  1,1//float4 TerrainPS( float4 Normal : TEXCOORD0, float2 Tex0 : TEXCOORD1,											 float2 Tex1 : TEXCOORD2,											 float2 Tex2 : TEXCOORD3,												 float2 Tex3 : TEXCOORD4,		// Tex coordinate 0											 float2 Tex4 : TEXCOORD5,		// Tex coordinate 1											 float2 Tex5 : TEXCOORD6,		// Tex coordinate 2											 float2 Tex6 : TEXCOORD7 ) : COLOR //  coordinate 3{		// Sample the texture map by the correct texture coords.	float4 texColor0 = tex2D( S0, Tex5 );	float4 texColor1 = tex2D( S0, Tex4 );	float4 texColor2 = tex2D( S0, Tex3 );	float4 texColor3 = tex2D( S0, Tex6 );		// This is the multi-texturing stage process.	//	// Tex1 and Tex2 represent the four textures	// Texture 0 == Tex1.x	// Texture 1 == Tex1.y	// Texture 2 == Tex2.x	// Texture 3 == Tex2.y	//	// They are set to 1 at the vertex if you painted a texture there	// Otherwise they are set to 0. This lets the texture fade across.	texColor0 *= Tex1.x;	texColor1 *= Tex1.y;	texColor2 *= Tex2.x;	texColor3 *= Tex2.y;		// Add them together for additive blending.	float4 texColor = texColor0 + texColor1 + texColor2 + texColor3;	// Note Normal is the diffuse component calculated in the vertex shader.	return texColor * Normal; //+ ambientMtrl * texColor;}

p.s. To jason, I'd love to have just one colored texture across the terrain but as per requirements I must be able to paint four different textures on to the terrain.

##### Share on other sites
What hardware are you testing it on?

It'd be worth trying to get a look at the assembly it generates, too. If I were writing this in assembly I'd write this:

ps.1.1tex t0tex t1tex t2tex t3add r0, t0, t1add r0, r0, t2add r0, r0, t3mul r0, r0, v0

That assumes the texture is bound to four seperate stages, of course - as far as I know, multiple reads from the same texture stage ain't a good thing.

##### Share on other sites
That assembly shader does not perform the same operations that ph33r's HLSL does, specifically there is no account for the optional blending of textures - you need mads for that.

I used fxc to generate the assembly output from ph33r's shader, and it's 4 texture instructions followed by 5 arithmatic. That's not a big requirement really. I'll have to think about it some more before I come to any conclusions.

-Mezz

##### Share on other sites
Quote:
 Original post by superpigWhat hardware are you testing it on?

Quote:
 It'd be worth trying to get a look at the assembly it generates, too.

//// Generated by Microsoft (R) D3DX9 Shader Compiler 5.04.00.2904//// Parameters:////   sampler2D Texture0;////// Registers:////   Name         Reg   Size//   ------------ ----- ----//   Texture0     s0       1//    ps_2_0    dcl t0    dcl t2.xy    dcl t3.xy    dcl t4.xy    dcl t5.xy    dcl t6.xy    dcl t7.xy    dcl_2d s0    texld r3, t5, s0    texld r2, t6, s0    texld r1, t4, s0    texld r0, t7, s0    mul r3, r3, t2.y    mad r2, r2, t2.x, r3    mad r1, r1, t3.x, r2    mad r0, r0, t3.y, r1    mul r0, r0, t0    mov oC0, r0// approximately 10 instruction slots used (4 texture, 6 arithmetic)

Quote:
 That assumes the texture is bound to four seperate stages, of course - as far as I know, multiple reads from the same texture stage ain't a good thing.

The texture is only bound to one stage. I didn't know there would be a benefit from putting one texture on multiple stages. I assumed just using one texture one stage and four reads from it would be good.

Quote:
 Original post by MezzI'll have to think about it some more before I come to any conclusions.

Thanks mezz.

##### Share on other sites
Quote:
 Original post by MezzThat assembly shader does not perform the same operations that ph33r's HLSL does, specifically there is no account for the optional blending of textures - you need mads for that.
Oh, whoops, I missed that in the original HLSL code.

I don't have the hardware to play with ps_2_0 stuff so I've not been able to test whether having the texture bound to four seperate samplers is faster than reusing one. You'd have to profile it. Ah - if you've got NVidia hardware, have you tried using NVPerfHUD?

I don't think there's anything here that actually requires ps_2_0 though.

##### Share on other sites
You know, you're currently burning 7 texture coordinate sets to do that. You can take advantage of the fact that texcoords are float4. You can sample the first texture with (tex0.xy), the second with (tex0.zw), and so on. And you can store all the blending factors into just one texcoord. Not sure if this will increase performance, but it might. And try to bind to 4 different stages. It may enable better pipelining.

##### Share on other sites
Mikeman:

I should bind the same texture to four different stages? Or just use 4 texturse on 4 stages instead of a texture map.

##### Share on other sites
Quote:
 Original post by superpigOh, whoops, I missed that in the original HLSL code.

Hey no probs, I did myself when I originally looked at it - it was only when I copy & pasted it into a file and ran fxc and I was like "why is this generating mads?" that I went back and looked at it properly.

Quote:
 Original post by superpigI don't think there's anything here that actually requires ps_2_0 though.

I think I'd agree, apart from possibly running out of texture stages (because you'd need to use the interpolators for those texcoords which tell you whether or not the texture is active on that vertex or not).

I still can't think why this shader would be particularly slow, you can try binding the texture to different stages... it might help it go through some other silicon. You could also try interlacing the texture/arithmatic instructions to hide latency, but I'd have thought the internal compiler would've done that for you *shrug*.

I can't really think of much else, if you're on NV hardware, you could try PerfHud as superpig suggested.

-Mezz

##### Share on other sites
If you keep the texture sampling but remove the math part of the shader, does performance increase? In that case you're probably tex bandwidth bound rather than actually fill rate bound. If that is the case, make sure mip mapping is on and try texture compression and lowering texture resolution to improve performance.

##### Share on other sites
Couldn't he test that by switching temporarily to 1x1 sixed textures too? I think I heard that's how NVPerfHUD measures things...

##### Share on other sites
Sure, that would work. Except if he's bound by the actual lookup instructions and not bandwidth of course.

##### Share on other sites
First of all don't use PS_2_0 for such things. PS_1_1/PS_1_3 is enough.
Second, you can put all the multiplier constants into ONE texcoord.

float4 texColor0 = tex2D( S0, Tex1 );
float4 texColor1 = tex2D( S0, Tex2 );
float4 texColor2 = tex2D( S0, Tex3 );
float4 texColor3 = tex2D( S0, Tex4 );

texColor0 *= Tex5.x;
texColor1 *= Tex5.y;
texColor2 *= Tex5.z;
texColor3 *= Tex5.w;

.. Etc.

Last but not least, you CAN write this pixel shader with the fixed function pipeline using some smart evaluation.
Since the multipliers are monochrome floats you can store them per vertex at the diffuse / specular components.

However IMO if the PS_1_3/PS_1_1 version is still slow the bottleneck is just sampling the textures, if reducing texture size does not work, and removing the blending multipliers (Tex5) does not help as well, your card may be too slow, or just update your drivers.

##### Share on other sites
Basically sounds like the farcry method of texturing terrain. My renderer divides the terrain into chunks for LOD reasons. What i did to optimize it was: Check if a chunk really needs the 4 texture units, if it doesnt, user a simpler pixel shader that can only blend 2 texunits for example ( farcry also does that ).
Another thing that is a potential performance burner is the "intuitive" rendering order that would be like:

for each chunk
{
Render base texture
Render multitextured detail
}

These kind of shader changes are extremely expensive, try to render all base textures first, then ( if you use the method i described above) render all chunks that use 4 units, then 3, 2, 1... you get the clue...

Hope it helps

Edit: Oh btw what i forgot: you should also restrict the detail texturing distance in some way, its usually not worth detail texturing anything because it will eat lots of fillrate without much of a visual change.

Another thing that may help you is to keep the size of your textures constant while enabling DXT compression. Alternatively you could play around with the filtering mode ( altough this one definately worsens your visual quality while the quality loss through compression is pretty neglectable for detail textures )

this is my code if your curious... its GLSL... ( uses the RGBA as intensities for the different textures )
[VP]varying vec2 texCoord0;uniform vec3 eyePos;uniform float lodBias;void main(){    texCoord0=gl_MultiTexCoord0.xy*128.0;		    vec3 diff=eyePos-gl_Vertex.xyz;    float inten1=1.0-clamp(dot(diff,diff)/(15000.0*lodBias),0.0,1.0);    // this serves to create a fade out effect at the lod edges    gl_FrontColor.rgba = gl_Color.rgba*inten1;    gl_Position = ftransform();}[FP]uniform sampler2D detMap0;uniform sampler2D detMap1;uniform sampler2D detMap2;uniform sampler2D detMap3;varying vec2 texCoord0;void main () {     vec4 tempcolor=vec4(0.5)-0.5*gl_Color ;      vec3 detColor0 = texture2D(detMap0, texCoord0).xyz;      detColor0 = detColor0*gl_Color.r + tempcolor.r;      vec3 detColor1 = texture2D(detMap1, texCoord0).xyz;      detColor1 = detColor1*gl_Color.g + tempcolor.g;      vec3 detColor2 = texture2D(detMap2, texCoord0).xyz;      detColor2 = detColor2*gl_Color.b + tempcolor.b;      vec3 detColor3 = texture2D(detMap3, texCoord0).xyz;      detColor3 = detColor3*gl_Color.a + tempcolor.a;        gl_FragColor=vec4(detColor0 * detColor1 * detColor2 * detColor3 * 8.0,1.0); }

[Edited by - Dtag on January 10, 2005 9:23:03 AM]

##### Share on other sites
Something that hasn't been mentioned yet, are you taking any measures to reduce overdraw? Since the shader is expensive anyway, it is essential. Maybe you should work on this area too. I mean, it's not only how long the shader is, it's also how many times it gets executed. I suppose you have at least backface culling on, right?