Jump to content

  • Log In with Google      Sign In   
  • Create Account


OpenGL ASM Experiment


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
13 replies to this topic

#1 dxCUDA   Members   -  Reputation: 141

Like
0Likes
Like

Posted 27 May 2012 - 03:08 PM

Experimenting with learning x64 asm and how code operates on a low level, and trying to work out where speed is increased in projects when ASM is appropriately used.

My project is x64 with MASM in visual studio 2010

Currently I am trying to replace the basic glClear glLoadIdentity etc with asm to understand how to interface between code and asm.

I'm following the asm examples over at NEHE, but I constantly keep getting an access violation error when I seem to introduce 'push' into the code.

This is the code asm-side:
include gl.inc
include glu.inc

.data
_45d0 equ 40468000h ;45.0
_45d1   equ 0
_01d0 equ 1069128089  
_01d1   equ -1717986918 ;0.1
_100d0 equ 1079574528
_100d1  equ 0  ;100.0
_1d0 equ 1072693248
_1d1 equ 0   ;1.0
_05 equ 1056964608  ; 0.5
_1 equ 1065353216  ; 1.0
_m1 equ -1082130432 ;-1.0
_3 equ 1077936128  ; 3.0
_m15 equ -1077936128 ;-1.5
_m6 equ -1061158912 ;-6.0

.code

ASMrender proc
display:

push   GL_COLOR_BUFFER_BIT
call glClear
call glLoadIdentity
call glEnd

push _m15
push 0
push _m6
call glTranslatef

xor eax,eax

ret

ASMrender endp
end


Over in C++ I am just doing the standard 'extern "C" void ASMrender
and then calling ASMrender() in the main render loop.

Here's a screenshot of some of the action where it all goes wrong.
Posted Image

Sponsor:

#2 Martins Mozeiko   Crossbones+   -  Reputation: 1413

Like
1Likes
Like

Posted 27 May 2012 - 03:38 PM

Replacing OpenGL calls from C with OpenGL calls from assembler will gain you nothing in terms of performance.

I believe 64-bit windows calling convention for first four floating point arguments requires them to go in SSE2 registers, not on stack: http://msdn.microsof...y/zthk2dkh.aspx
Same thing for integer argument - you should pass argument to glClear on RCX register, not on stack.
You should better write code in C, and check the disassembler how it looks (by putting breakpoint on function call and selecting Dissassembly view from Debug menu - Alt+8). You'll see that Visual Studio will generate pretty optimal code in this case.

Also - you can not call glEnd without glBegin.

Edited by Martins Mozeiko, 27 May 2012 - 03:38 PM.


#3 Erik Rufelt   Crossbones+   -  Reputation: 3139

Like
1Likes
Like

Posted 27 May 2012 - 04:27 PM

Also, from http://msdn.microsof...y/ms235286.aspx:

The caller is responsible for allocating space for parameters to the callee, and must always allocate sufficient space for the 4 register parameters, even if the callee doesn’t have that many parameters.


So something like this:
mov rcx, GL_COLOR_BUFFER_BIT ; parameter
sub rsp, 32 ; shadow space for 4 registers
call glClear
add rsp, 32 ; pop register shadows

Edited by Erik Rufelt, 27 May 2012 - 04:29 PM.


#4 dxCUDA   Members   -  Reputation: 141

Like
0Likes
Like

Posted 27 May 2012 - 04:49 PM

Thanks Rufelt,

I realize this is a somewhat pointless effort in terms of optimizations and yadda yadda. I'd rather just learn how asm works and once I am familiar, in the future I can focus on some decent optimizations with SSE(or so I've heard)

So the asm now looks something like this, however the screen is black, no white triangle:
sub rsp, 32h
mov ecx, GL_COLOR_BUFFER_BIT or GL_DEPTH_BUFFER_BIT
call glClear
add rsp, 32h

call glLoadIdentity

sub rsp, 20h
mov rdx, _m15
mov rcx, 0
mov r8d, _m6
call glTranslatef
add rsp, 20h

sub rsp, 18h
mov edx, GL_TRIANGLES
call glBegin
add rsp, 18h

sub rsp, 20h
mov edx, 0
mov ecx, _1
mov r8d, 0
call glVertex3f
add rsp, 20h

sub rsp, 20h
mov edx, _m1
mov ecx, _m1
mov r8d, 0
call glVertex3f
add rsp, 20h

sub rsp, 20h
mov edx, _1
mov ecx, _m1
mov r8d, 0
call glVertex3f
add rsp, 20h

call glEnd

ret

Edited by dxCUDA, 27 May 2012 - 04:50 PM.


#5 Erik Rufelt   Crossbones+   -  Reputation: 3139

Like
1Likes
Like

Posted 27 May 2012 - 05:10 PM

mov edx, GL_TRIANGLES
That should be ecx.

#6 dxCUDA   Members   -  Reputation: 141

Like
0Likes
Like

Posted 27 May 2012 - 06:57 PM

Cheers for that, Still doesn't seem to be working. All I get is a black screen. I have a feeling it might be the data from _m1, _m15 etc. I'll keep digging

#7 dxCUDA   Members   -  Reputation: 141

Like
0Likes
Like

Posted 27 May 2012 - 07:21 PM

Well I check the value of _m15 which equated to 0BFC00000h so I created _neg15 dd -1.5f to be more clear on the value and this was also 0BFC00000h .

I'm entirely certain that everything is working accordingly, as if I just chose to contain glClear and glLoadIdentity within the asm function, in conjunction with triangle drawing C++ side, and it works. So it must be incorrect data parsed in asm to the gl procedures.

#8 Martins Mozeiko   Crossbones+   -  Reputation: 1413

Like
1Likes
Like

Posted 27 May 2012 - 11:32 PM

glTranslatef arguments are floats. First 4 float arguments of function are passed in SSE2 registers (not rdx/rcx/r8d) as described in the link I and Erik posted above.

And if all you want is to optimize using SSE instructions, then there is no need to do assembly. You simply can use intrinsic functions. It will greatly simplify your life and will give compiler more chance to optimize the code better (inlining & other stuff). Also advantage will be that same code will work for 32-bit target - no need to write assembly twice (for 32 and 64-bit).
http://msdn.microsof...y/y0dh78ez.aspx

And I'll repeat myself. To easier spot mistake in your assembly for such simple code - write the same code in C, and inspect generated assembly (press Alt+8 while debugging) and examine generated assembly code to see the differences from your written assembly.

Edited by Martins Mozeiko, 27 May 2012 - 11:34 PM.


#9 dxCUDA   Members   -  Reputation: 141

Like
0Likes
Like

Posted 28 May 2012 - 05:03 AM

Thanks Mozeiko

That's a good direction to take after this learning exercise. I didn't quite understand from the other post initially about the SEE2 registers, but I do now and it works..

Would asm be viable in a runtime situation whereby we have an if -else statement, we could then do this in asm and avoid the doubled call each frame, as it would remove half the calling?. I could see it being better than C++ in such a situation, or would I still be wrong?

#10 dxCUDA   Members   -  Reputation: 141

Like
0Likes
Like

Posted 28 May 2012 - 05:56 AM

It works now, thanks for the help, even though the triangle is a little bit weird. Here's the code for any future searches:

include gl.inc
include glu.inc


.data

_neg15 dd -1.5f  ;
_neg6 dd -6.0f
_pos1 dd 1.0f
_neg1 dd -1.0f
_20	 dd 20.0f


.code

ASMrender proc


sub rsp, 32h
mov ecx, GL_COLOR_BUFFER_BIT or GL_DEPTH_BUFFER_BIT
call glClear
add rsp, 32h

call glLoadIdentity

sub rsp, 32h
movss xmm2, dword ptr [_neg15]
xorps   xmm1,xmm1
movss xmm0,  dword ptr [_neg1]
call glTranslatef
add rsp, 32h

sub rsp, 32h
mov ecx, GL_TRIANGLES
call glBegin
add rsp, 32h


sub rsp, 32h
xorps   xmm2,xmm2
movss xmm1, dword ptr [_pos1]
xorps   xmm0,xmm0
call glVertex3f
add rsp, 32h

sub rsp, 32h
movss xmm2, dword ptr [_neg1]
movss xmm1, dword ptr [_neg1]
xorps   xmm0,xmm0
call glVertex3f
add rsp, 32h


sub rsp, 32h
movss xmm2, dword ptr [_pos1]
movss xmm1, dword ptr [_neg1]
xorps   xmm0,xmm0
call glVertex3f
add rsp, 32h

call glEnd


ret

ASMrender endp
end

Edited by dxCUDA, 28 May 2012 - 05:57 AM.


#11 Martins Mozeiko   Crossbones+   -  Reputation: 1413

Like
0Likes
Like

Posted 28 May 2012 - 02:10 PM

Would asm be viable in a runtime situation whereby we have an if -else statement, we could then do this in asm and avoid the doubled call each frame, as it would remove half the calling?. I could see it being better than C++ in such a situation, or would I still be wrong?

It doesn't matter. Any reasonable optimizing C/C++ compiler (including MSVC and GCC) will generate good code for one simple if-else branch.

Edited by Martins Mozeiko, 28 May 2012 - 02:11 PM.


#12 BornToCode   Members   -  Reputation: 912

Like
0Likes
Like

Posted 05 June 2012 - 03:24 PM

I will never understand this. People wasting time writting things like that in asm. The compiler will do a way better job than you.

#13 larspensjo   Members   -  Reputation: 1526

Like
0Likes
Like

Posted 06 June 2012 - 03:45 AM

It is interesting to have some understanding of assembler. It's like studying spoken languages, where learning of Latin can be interesting. But most people don't have the time for it.

To optimize, you can almost always get better results spending the effort on algorithms instead.

In the case of OpenGL, trying to optimize while using legacy OpenGL (as OP is doing) is like taking a modern car, replacing the engine with a steam engine, and trying to optimize the steam engine using different qualities of coal.
Current project: Ephenation.
Sharing OpenGL experiences: http://ephenationopengl.blogspot.com/

#14 maxgpgpu   Crossbones+   -  Reputation: 279

Like
0Likes
Like

Posted 08 June 2012 - 06:32 PM

Hello dxCUDA. I just noticed your thread. Hey, I have two 64-bit assembly language files in my 3D engine. Most important for you and I, one of them contains a 4x4 matrix multiply function (of f64 AKA double elements), plus a function that transforms my vertices from local to world coordinates. My vertices contain:

1 position
1 zenith vector (normal vector)
1 north vector (tangent vector)
1 east vector (bi-tangent vector)
1 texture-coordinate (just moved from local-vertex to world-vertex)
1 16-bit field of option bits (ditto)
1 texture-ID field (ditto)
1 matrix-ID field (ditto)
1 RGBA color (ditto)

So the function multiplies the input transformation matrix time each position, zenith-vector, north-vector, east-vector in the local-coordinate vertex structure, then stores the result in two world-coordinate vertex structures (one contains f64 == double-precision position/vectors, and the other contains f32 == single precision position/vectors). After the transformation my engine transfers the 32-bit structure to the GPU, then calls glDrawElements().

Anyway, I have not decided whether to make my code open-source yet, but I'm willing to send it to you for education purposes. As I recall, it has quite a few comments at the top about how 64-bit function calls work (where the arguments are, what needs to be preserved, etc). I also have 32-bit versions of the same functions in another file (since I can compile both 32-bit and 64-bit versions of my engine). I also have C code for the matrix multiply, and somewhere in my engine is equivalent C code for the vertex transformation function, so you can compare if you wish.

I was absolutely blown away when I benchmarked these routines. It takes only a few nanoseconds to transform the position and three vectors in each vertex from local to world coordinates, and save it in both 64-bit and 32-bit form (the local-coords input is 64-bit form)... plus transfer the other fields too. And that is only running on one of my 8 cores so far!

PS: I don't think I have a 64-bit version in MASM yet, because my windoze computer is still windoze XP 64-bit edition, which DOES NOT support 16 SIMD registers and DOES NOT support the wider AVX/ymm registers (which hold and process four f64 values at once). What I definitely do have is 64-bit version in GAS (linux syntax), as well as 32-bit versions in GAS and MASM.

If you're interested, let me know. And we can chat on skype if you wish to pick my brain about this topic.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS