• Advertisement
Sign in to follow this  

Vertex shader slower than fixed function pipeline?

This topic is 4345 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, I've read that shaders are just as fast as the fixed pipeline but I decided to do my own tests to be sure. I found some interesting results. In short, I have found fragment shaders to be just as fast (if not then a little faster) than the fixed function pipeline. However, vertex shaders seem significantly slower. I have tested with the following code. I used a display list as this removed download bottlenecks from the equation.
        #define ISURFACES 6000

        glFinish();
        double t1 = getTime();

        glslshader->begin();

        static int first=1;
        glBindTexture ( GL_TEXTURE_2D, basetex );
        for (int i=0;i<ISURFACES;i++)
        {

                float quadwidth  = 20.0;
                float tx=0.0;
                float ty=0.0;
                float h = -10.0;
                glPushMatrix();
                glTranslatef(xg,0.0,0.0);

                if (first)
                {
                        glNewList(DL,GL_COMPILE);
                        glPushMatrix();
                        glBegin(GL_QUADS);
                        for (float x = -50.0 ; x < 50.0 ; x+= quadwidth)
                        {
                                tx=0.0;
                                for (float y = -50.0 ; y < 50.0 ; y+= quadwidth)
                                {
                                        glTexCoord2f(tx,ty);
                                        glVertex3f(x,h,y);
                                        glTexCoord2f(tx,ty+0.2);
                                        glVertex3f(x,h,y+quadwidth);
                                        glTexCoord2f(tx+0.2,ty+0.2);
                                        glVertex3f(x+quadwidth,h,y+quadwidth);
                                        glTexCoord2f(tx+0.2,ty);
                                        glVertex3f(x+quadwidth,h,y);
                                        tx+=0.2;
                                }
                                ty+=0.2;
                        }
                        glEnd();
                        glPopMatrix();
                        glEndList();
                        first=0;

                }else
                {
                        glCallList(DL);

                }
                glPopMatrix();
                xg+=100.0;


        }

        glslshader->end();
        glFinish();
        fprintf(stderr,"Time elapsed=%0.2f\n",(getTime() - t1) * 1000.0);



On my GF 6800 Ultra, the above 6000 primatives with the fixed function pipeline took 4 milliseconds to render. On the other hand, to render with a vertex shader enabled took over 10 milliseconds! The glsl shader looks like this (pretty simple):
void main()
{
       gl_TexCoord[0] = gl_MultiTexCoord0;    
       gl_Position = ftransform();

}

I have also tried with CG and ARB assembly shaders. All with the same results. Is there an optimization I am missong here?

Share this post


Link to post
Share on other sites
Advertisement


Frankly, I dont care about the ATI vs Nvidia arguement. Can anyone shed some light as to the origional question?

Share this post


Link to post
Share on other sites


The glFinish() is only there to take meaningfull post-fill performance timing figures.

However, in this case using glFlush instead of glFinish makes no difference to the outcome (I have tried).

Share this post


Link to post
Share on other sites
Quote:
Original post by lynedavid


The glFinish() is only there to take meaningfull post-fill performance timing figures.

However, in this case using glFlush instead of glFinish makes no difference to the outcome (I have tried).


Why don't you try running it for a few seconds and measure the everage ms/frame?

Share this post


Link to post
Share on other sites

Ok, I removed the glFinish's and took some fps's. Here are the results:

Vertex shader enabled: 65fps
Vertex shader disabled: 170fps.

The gain appears to actually increase with the fixed function pipeline in this test!

Share this post


Link to post
Share on other sites


The test shown in the source code is very vertex intensive. Very little download and fill influence the results.

Approx 300000 polygons are drawn per frame.

Share this post


Link to post
Share on other sites
Does performance increase if you just bind the shader once instead of every frame (assuming your shader's begin() and end() functions bind and unbind the shader)?

Share this post


Link to post
Share on other sites
ok, the real answer here is: it depends on the shader. While its possible to write an efficient shader its also possible to write one which is horribly inefficient. The driver internally builds shaders to implement fixed function T&L (this is true for both NVIDIA and ATI for the last two generations of chips) so the comparison is really between driver-built shaders and user-built shaders.

The driver writers get paid to make sure the fixed function emulation shaders are very fast!

Share this post


Link to post
Share on other sites
Quote:
Original post by Kalidor
Does performance increase if you just bind the shader once instead of every frame (assuming your shader's begin() and end() functions bind and unbind the shader)?


Not noticably.

Share this post


Link to post
Share on other sites
Quote:
Original post by gold
ok, the real answer here is: it depends on the shader. While its possible to write an efficient shader its also possible to write one which is horribly inefficient. The driver internally builds shaders to implement fixed function T&L (this is true for both NVIDIA and ATI for the last two generations of chips) so the comparison is really between driver-built shaders and user-built shaders.

The driver writers get paid to make sure the fixed function emulation shaders are very fast!


As I understand it, the NV40 does not have a different implementation for the fixed function pipeline. Although this did used to be the case. Although, noone really knows!

The vertex shader I am using in this test is very simple. I am even using the special ftransform macro, which is supposed to accelerate the modelview*projection matrix vertex transformation (somehow):


void main()
{
gl_TexCoord[0] = gl_MultiTexCoord0;
gl_Position = ftransform();

}

Share this post


Link to post
Share on other sites
Quote:
Original post by lynedavid
I am even using the special ftransform macro, which is supposed to accelerate the modelview*projection matrix vertex transformation (somehow)


The ftransform() function is not meant to accelerate the transform, but it is there to ensure that, for example, if you're performing a multipass algorithm (with one pass using FFP and one using the PP) that both will arrive at the same transformed vertex.

From the GLSL specification:

This function will ensure that the incoming vertex
value will be transformed in a way that produces
exactly the same result as would be produced by
OpenGL's fixed functionality transform.

Share this post


Link to post
Share on other sites
Quote:
Original post by Myopic Rhino
Quote:
Original post by lynedavid
Although, noone really knows!
gold knows.

This is pretty perplexing. Your vertex shader is about as simple as it gets, so it shouldn't be slower than fixed function. Have you tried out your test program on other hardware?


A Geforce 7800 GTX 512mb card (at work). A little faster for both cases but the same relationship still holds.

Havn't tried an ATI card though. I've just given my Radeon 9800 Pro to one of my friends.

Share this post


Link to post
Share on other sites
Hi,

FFP will be always faster. Thats because its FIXED so its optimized for certain processes but won't give you the flexibility of a VS.

On the other hand, if you program a shader and compile it as a 1.1 VS and compile the same shader as 2.0 Shader and compile the same shader as a 3.0 shader, the 1.1 shader wil run faster. Its because each shader version allows for more flexible execution. I.E. In VS1.1 you don't get loops and conditionals while in VS 2.0 and 3.0 they are accepted.

Maybe you should check the nVidia and ATI programmer sites for more info about this. Anyway I wouldn't bother. VS and PS are way superior than FFP and FFP will be dropped in future HW so make you a favor and drop it from your game or engine.

Luck!
Guimo





Share this post


Link to post
Share on other sites
Quote:
Original post by Guimo
FFP will be always faster. Thats because its FIXED so its optimized for certain processes but won't give you the flexibility of a VS.
The fixed-function pipeline hasn't existed in hardware for at least the last two generations. It's implemented using shaders in the programmable pipeline.
Quote:
Original post by Guimo
Maybe you should check the nVidia and ATI programmer sites for more info about this.
He's already getting support from ATI and nVidia engineers in this thread; why go elsewhere? [grin]

Share this post


Link to post
Share on other sites
Quote:
Original post by Myopic Rhino
He's already getting support from ATI and nVidia engineers in this thread; why go elsewhere? [grin]

I don't mean to digress from the purpose of the thread, but I (and I'm sure a lot of other people) am quite in the dark about who you are referring to ...

If this is classified information [grin], or this post isn't appropriate here please PM me or say so and I'll delete it at once.

Share this post


Link to post
Share on other sites
gold, i believe is michael gold. whos prolly been invloved with gl for >10 years (with sgi + now nvidia) not implying hes an old fart or anything.

Share this post


Link to post
Share on other sites
On a GeForce 6800/7800, your vertex shader generates the following *native* code for the GPU:


401F9C6C 01CD400D 8106C0C3 60411F80 DP4 o[HPOS].x, v[OPOS], c[212];
401F9C6C 01CD500D 8106C0C3 60409F80 DP4 o[HPOS].y, v[OPOS], c[213];
401F9C6C 01CD600D 8106C0C3 60405F80 DP4 o[HPOS].z, v[OPOS], c[214];
401F9C6C 01CD700D 8106C0C3 60403F80 DP4 o[HPOS].w, v[OPOS], c[215];
401F9C6C 00400808 0106C083 60419F9D MOV o[TEX0].xy, v[TEX0].xyxx;


(The hexcodes are the 128-bit instruction words.) The "fixed-function" path produces exactly the same code, but with different constant register indexes.

If you want to do very accurate timing on the GPU, you can use the EXT_timer_query extension. If defines the following enums and functions.


#define GL_TIME_ELAPSED_EXT 0x88BF

typedef __int64 GLint64EXT;
typedef unsigned __int64 GLuint64EXT;

void glGetQueryObjecti64vEXT(GLuint id, GLenum pname, GLint64EXT *params);
void glGetQueryObjectui64vEXT(GLuint id, GLenum pname, GLuint64EXT *params);


Use the glBeginQuery/glEndQuery mechanism with the GL_TIME_ELAPSED_EXT target to specify a timing interval. A call to glGetQueryObjectui64vEXT with <pname> GL_QUERY_RESULT returns the elapsed time in nanoseconds.

Share this post


Link to post
Share on other sites
BTW, in case you're curious, your vertex shader produces the following native code on Radeon X800/X1800.


00100201 00D10002 00D10001 00D10005 DP4 o[0].x, c[0], v[0];
00200201 00D10022 00D10001 00D10005 DP4 o[0].y, c[1], v[0];
00400201 00D10042 00D10001 00D10005 DP4 o[0].z, c[2], v[0];
00800201 00D10062 00D10001 00D10005 DP4 o[0].w, c[3], v[0];
00F02203 01648000 01248000 01248005 MOV o[1], R0.0001;
00304203 00D10041 01248041 01248045 MOV o[2].xy, v[2];


The ATI driver inserts an extra instruction to move (0,0,0,1) into the primary color interpolant, but otherwise, it's the same native instruction sequence that Nvidia hardware uses.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement