Possible problem with Ati + OpenGL (GLSL)

Started by
10 comments, last by _the_phantom_ 19 years, 3 months ago
(This is a long post) I am developing a program that makes heavely use of Vertex and especially the programmable fragment shader. AT the start of the project GLSL was choosed as the language for implementation, I have however run into some problem where it on a Nvidia Geforce 6800 card will render at ~15fps, while it using either Radeon 9800 pro or X800 Pro will take about 3 minutes per frame to render. I am quite bedazzled to why it that happend since I don't see any reason for it. I am hesitating to post the actuall shaders as I am unsure to weather or not I am allowed to publish them yet but this is how the system is implemented. Sending data is done through Immediate mode calls at the moment because of the layout of the data that needs to be sent recieved. [Reason I do this follows: I need to send six arrays of vec3 (float) that will remain constant for ~15 vertices only before they change value again (ammount of vertices in the stream is in order of 100k-1,000k), and everything is recalculated every frame, nothing can really be assumed to be constant, and using immediate mode calls to glMultitexCoord3fvARB sending it down through GL_TEXCOORD[0 through 5]_ARB saves me from manually copying them around in memory. Immediate calls acts like "statechanges" in OpenGL for the texcoord streams. SO what I send down (per vertex) is the following: 3float (vertex), 6x3float (texcoords). There is only 1 texture bound, is is a floatingpoint texture rendered with Render-To-Texture extension (internal format is RGBA 16 bit float ATI format, supported by both 6800 series geforce and 9800, x800 ati cards). The render target of the entire operation is a 8 bit texture using render-to-texture extension and a simple glBlendFunc(). [The render-to-texture, pbuffers and texture are working, this has been thourgly tested on both Geforce 6800 and Radeon 9800 card in a slightly different mechanism of sending data and other shaders, just wanted to inform of the situation to avoid any confusions/assumptions] A vec4 containing the view_port and a float containing 1.0/float(screen_resolution) are sent down as uniforms on shader creation. Per frame basis a lightvector is sent down using uniform also. That is how all data is sent to the shaders. The Vertex shader I can post as it dosn't contain any *sensetive* algorithm/method.

uniform sampler2D DepthTexture;
uniform vec3 lightVector;
varying float fresnel;
varying vec3 Point[3];
varying vec3 Line[3];
varying vec4 TriNorm;

void main()
{
	gl_Position = ftransform();

	Point[0] = gl_MultiTexCoord0.xyz;
	Point[1] = gl_MultiTexCoord1.xyz;
	Point[2] = gl_MultiTexCoord2.xyz;
	Line[0] = normalize(gl_MultiTexCoord3.xyz-Point[0]);
	Line[1] = normalize(gl_MultiTexCoord4.xyz-Point[1]);
	Line[2] = normalize(gl_MultiTexCoord5.xyz-Point[2]);

	TriNorm.xyz = normalize(-cross(Point[1]-Point[0],Point[2]-Point[0]));
	TriNorm.w = length(cross(Point[1]-Point[0], Point[2]-Point[0]));
	
	vec3 lightVec = normalize(lightVector);
	/*float a = acos(dot(TriNorm.xyz,-lightVector));
	float b = acos(dot(-TriNorm.xyz,refract(lightVector, TriNorm.xyz, 1.0/1.33)));
	fresnel = ( sin(a-b)*sin(a-b) / (sin(a+b)*sin(a+b)) ) + tan(a-b)*tan(a-b)*tan(a+b)*tan(a+b); */
	fresnel = 1.0; // (1.0-fresnel);
} 




The fresnel is turned on for Geforce 6800 and off for all ATI cards btw since in testing applications Radeon 9800 (atleast on my sytem) executing the "refract" function from GLSL dropped framrate in the order of magnitude 100 times. The fragmentshader used I don't want to paste incase it gets me in trouble but I can explain some properties it have. Using Nvidias shaderperf program it has been possible to see what amount of code it translates the fragment program into and here is the output on on "NVShaderPerf -a NV40 fragmenstsader.glsl" # 51 instructions, 7 R-regs, 1 H-regs -------------------- NV40 -------------------- Target: GeForce 6800 Ultra (NV40) :: Unified Compiler: v61.77 Cycles: 51.36 :: R Regs Used: 10 :: R Regs Max Index (0 based): 9 Pixel throughput (assuming 1 cycle texture lookup) 125.49 MP/s This is indeed a big shader but from what I been able to see it would be well within the resource limits (instruction count wise and register use wise) of that of a Radeon 9800 and X800. The shader only utilizes a very small set of ARB_fragment_program instructions: ADDR (Addition) MULR (Multiplication) MADR (Multiplication + Addition) MOVR (Move) (just a few) RCPR (1.0 / x) (only 3) RSQR (reciprocal square root) (only 1) DP3R (vectorwise dot) (only 1) SGER (greater than comparision) (only 2) SGTR (another greater than comparision) (only 1) TEX (normal 2D sampler with floatingpoint target) (only 1) Wich means the shader is basicly loads of additioning and multiplying. Is there anyone with the information I have given would know if anything I have mentioned above would make a ATo card (9800 or X800) go into sowftware mode for this ? It is mildly annoying as in the project NO Nvidia specific extension have been used, only ATI one (for the floatinpoint texture) and remaining ARB. The project runs like a charm on the Geforce 6800 card but not on 9800. Even ruling out Everything I posted as correctly programmed for ATI cards would be greatfull as I have now spent well over a week's work trying to figure out what went wrong. The latest ATI drivers (4.12) are being used btw Does the ATI cards have problems sending float data through texcoords? Are the ATI interpolators given to much work ? Is the ATI compiler failing to produce code for the fragmentshader that stays withing resource limits ? (I get NO errors through OpenGl or the log of any shader), it renders correctly on the ATi cards, but with 1 frame every 3-4 minutes instead of the geforce 15 frames/second. I really really would like to run this on ATI cards before I try and get it published (and after that open up all sources for public viewing ofcourse!) On a sidenote I wonder why ATI does not produce any public tools to aid developers with GLSL (shaders in generals even) on their cards like nvidia are doing (cg and nvshaderperf)... Edit: Noticed it was source and not code-tags on this forum. Edit2: The code was incorrect, a closingtag for a comment was missing. Thanks for reading this huge post, but I have gone blind from starrying on the problem for so long now. [Edited by - todderod on January 11, 2005 7:14:30 PM]
Advertisement
when you say 'no error logs' from the compiler, do you mean that when you query it the compiler says its done ok, or do you read back the logs and they dont indicate they have dropped to software mode? (as frankly, at 3mins per frame that sounds like what has happened).

RenderMonkey is ATI's public tool for shader dev btw, I do hope we see something like the performance tool and Cg is a different beasty completely.

As for your direct questions;
- Texture coords are naturally floating point so I doubt thats a problem
- I doubt the interpolators are been over worked, certainly not enuff to see the kinda slow down you are seeing
- See my opening comment about the drop to software rendering.

I personally think you have exceeded the fragment program instruction count (and/or are doing something the ATI cards dont support/like, maybe branching and/or looping? which without the code is the best guess I can give you), the X800 has a better count than the 9800 series but its still not as good as the NV40 series cards (or even possible the NV30/35 ones, but dont quote me on that)
Thanks for the reply _the_phantom_

It seems it was so long since I looked at rendermonkey that I must then have missunderstood RenderMonkeys support for OpenGL.


What I mean with the error logs is that I check weather or not compilation and linking was successfull, if they werent the log is read back and printed. They give successfull compilation and linking.

I am sorry that I was vague and missed out what I mean to write about the fragment shader, I have deliberatly avoided all loops and ifstatments since I know Radean class hardware have troubles with them, and all dynamic branching have been avoided as you need aprox ~1000 spatial pixels for them to take the same branch to gain any speed from it (Nvidia states that and I belive them since it makes sence with how hardware is implemented) and I am no where near that.

"branching" is acheaved through float comparison and multiply gl_FragColor with the result to turn force fragment color to vec4(0.0 ,0.0 ,0.0 ,0.0)

I however have 2 functions oneliners actually that are there to make a repeated test more readable.

I have only used Rendermonkey for 15 minutes now but I am amused that it claims compiling vertex and fragment shading compiles and links successfully, but that they will run in software because "Invalid sampler named...". Wonder how it can think it sucessfully compiles then :) (And it is even more strange since it does work on the Geforce). Most likely a RenderMonkey twitch but it is something new to work on. Would be great if there was some way I could see the resources the shader uses up too (given ATI compiler).

Edit: Forgot to mention another fact also now I see. The coordinates for the texture lookup are derived withing the fragmentshader using the fragments positoin multiplied with the inverse screen resolution, but I don't see any problems with that either since I know that works in my testapplications even on ATI cards.
Hi,

I don't feel like giving you the definite answer to your problems, but when I am unable to solve a problem after two days or so, and I really really have to solve the problem, I fall back to stoneage-debugging.

So what I would do ... comment out line for line and locate the line/section, that makes the G-API fall back to software or slows it down.

You may use quick-sort strategies, i.e. divide the shader in two parts and comment out part one and vice versa, and look, which part makes the shader slow. Then take this part and devide this one again in two parts and proceed recursively down to the section/line making the shader slow.

And if you end up with commenting out the entire shader code and the performance is still slow, then it's maybe not the shader but the API-handling, i.e. wrong states/bad shader-registers or so.


Maybe a dumb proposal for my first post here at GD, and even time critical, but if you *really really* have to find the bug, maybe...
todderod;

you've just discovered one of the instresting oddities of GLSL, things can compile fine but still have log output, its often best to dump the log anyways just incase it has something has failed to cause it to dump to software mode.

I'm guessing that sampler is the root of the problem, as to why it works on NV and not ATI, well I couldnt tell you without the code.
However, it is possible it could be something as simple as typing 'Sampler2D' instead of 'sampler2D' as I dont know how strict NV's parser is.

Infact, if you want to rule out passer issues then check out 3DLabs validator (and they even have a compliance tester) @ here.
The compliance tester should give you a 100% result if you are running the latest (Cat4.12) drivers on the ATI card (and around 50% for the NV drivers, last test I'd seen from people anyways)
After setting up RenderMonkey and *correctly* seting up the environment for my shaders to work in I unfortenatly recieve this:

OpenGL Preview Window: Compiling fragment shader API(OpenGL) /Effect Group 1/Effect1/Pass 0/Fragment Program/ ... success
OpenGL Preview Window: Compiling vertex shader API(OpenGL) /Effect Group 1/Effect1/Pass 0/Vertex Program/ ... success
OpenGL Preview Window: Linking program ... success
Link successful. The GLSL vertex shader will run in software due to the GLSL fragment shader running in software. The GLSL fragment shader will run in software.


Wich I think is pretty sad since the nvidia compiler manages to compile code that (seemingly) would fit EASY into a Radeon 9800 GPU. *Persuades ATI to make better compiler, or a compiler that gives you more information of what it makes of your code*


Debuging shaders unfortenaly is very hard in GLSL when the shader is big, because of driver optimizations. If you don't use something the driver will remove it from the code entirely so, and turning of optimizations yields shaders that are so bloated there is no way to execute them.

That is the reason why I showed what instructions my programmed assembled down to using the Nvidia compiler, instructions that in no way should be offencive to such capeble cards such as the 9800 and X800.

I belive however you two have inspired me to a few other things I could possibly do to locate the offending part of the project.
ah, I just noticed the 51 instructions thing, so you're right it shouldnt be a problem to run it, my guess is that the ATI compiler isnt quite up to the job atm and is either running out of instructions or temps (the temp problem is one which is annoying atm as it prevents skeletal animation being performed), thus dumping back to software mode.
I'm hoping they will get this issue fixed soon, but then i'm also hoping for the framebuffer_object extension soon [grin]

I assume you are working for a company, if so I'd see about being ATI registered (if you arent already) and then picking their brains about the problem.
3D Labs Parser Tests come out "success" for both fragment and vertex program.

And I wish I was working for a company! This is mearly a academic research project.
ah, well still bother dev rel about it, they might be able to help all the same (drop the name of the place you are at in, it might help)
u can simplfy your shader a bit ( eg the following line)

also
fresnel = ( sin(a-b)*sin(a-b) / (sin(a+b)*sin(a+b)) ) + tan(a-b)*tan(a-b)*tan(a+b)*tan(a+b);
fresnel = 1.0; // (1.0-fresnel);

on nvidia the first line wont even get compiled (ie its ignored completely) since youve overwritten it with the next line (i have no idea what the ati compiler does, perhaps its not that smart?)

This topic is closed to new replies.

Advertisement