Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 22 May 2004
Offline Last Active Aug 28 2014 05:53 AM

#4956141 Is rendering to a texture faster than rendering to the screen?

Posted by clb on 05 July 2012 - 05:16 PM

Render operations targeting an off-screen texture are not any different (on any sane platform) from rendering to the backbuffer that's going to be shown on the screen. Although one should consider:
- the backbuffer is sometimes constrained to be the size of the native display resolution. Rendering to a smaller offscreen surface first and then blitting that to the main backbuffer can reduce fillrate requirements. This kind of approach was done e.g. in the nVidia Vulcan demo, in which they rendered the particles to a smaller offscreen render target to reduce fillrate & overdraw impact caused by the particles.
- for UI, rendering to an offscreen surface first can give performance improvements, as it can function as a form of batching & caching. Since you can reuse the contents of an offscreen render target between frames, you can draw multiple UI elements to it and reuse the offscreen texture between frames as long as the image doesn't change. One approach could be to store an offscreen cache texture for each individual window in your UI system, and render the contents of each window to these offscreen surfaces only when the window content changes, and each frame, you will then only need to render the UI cache textures to the main backbuffer. This helps to reduce #render call counts and repeated rendering of identical content each frame.
- rendering to an screen-sized offscreen texture first, and then blitting that texture to the main backbuffer later can also dramatically impact performance. E.g. Tegra2 Android devices can do about 2.5x fullscreen blits per frame, if they are targeting 60fps. That is not much, and rendering algorithms that require temporary offscreen surfaces (glow/bloom/hdr) can often be a no-go due to device fillrate restrictions.

#4955082 Using both TCP and UDP?

Posted by clb on 02 July 2012 - 04:24 PM

Sure, it's possible to simultaneously use both UDP and TCP.

We use that technique, but mostly for compatibility fallback purposes: clients primarily connect through UDP, but if firewalls/NATs don't allow, then go through TCP (and finally very brutishly through TCP 80). Our networking uses the kNet library, which abstracts the underlying TCP/UDP choice and exposes a message-based API.

To solve sending messages larger than the IP MTU, we manually fragment the messages to several UDP datagrams, and reassemble them on the receiving end. It's doable, and an alternative to using TCP.

If you're looking to utilize two socket connections, one TCP and one UDP simultaneously (also the same if you used two TCP ports), perhaps the first problem is that you won't have ordering, (or atomicity, when TCP is involved) guarantees across the two ports. E.g. sending to port A, then to port B, might be seen on the other end as receiving (even partially!) first from port B, then (again partially!) from port A, then from port B, port A, and so on.

#4954857 Optimal representation of direction in 3D?

Posted by clb on 02 July 2012 - 04:35 AM

The problems with using normalized 3D vectors to represent direction are:
- takes 3xfloats when only two are needed.
- after manipulating the direction, the result can drift to be non-normalized.

A more optimal compact representation of directions that doesn't suffer from the above problems is to use spherical coordinates: https://github.com/juj/MathGeoLib/blob/master/src/Math/float3.h#L284 . These are a set of two scalars (azimuth, inclination) which are rotation angles that represent the direction. This is similar to using polar coordinates in the 2D case. This reduces the space restriction to 2xfloats, and there's no way to drift the values to produce invalid non-directions. Using spherical coordinates is very effective e.g. when streaming data through network, to minimize data requirements (e.g. kNet uses these).

The problems with spherical coordinates are
- you can't do arithmetic directly on them without converting them to vector form first. E.g. rotating a direction represented by spherical coordinates by a quaternion/matrix is convoluted.
- conversion between spherical<->euclidean is much slower than re-normalizing a direction vector.

Because you'd have to the costly conversion between spherical<->euclidean at every point where you do math on them, it's far easier and faster to just use normalized euclidean direction vectors, and re-normalize at key points to avoid drifting.

#4954480 Software renderer: write 4 colors at once without reading old pixels, how?

Posted by clb on 01 July 2012 - 04:25 AM

In AVX, there's the _mm_maskstore_ps/vmaskmovps instruction. In SSE2, there's the _mm_masmoveu_si128/maskmovdqu instruction, but note that this instruction is in the class of byte-wide integer instructions, so it can generate few cycles of stall in the pipeline when used (profile?) if a transition from float mode to int mode occurs.

If you are doing manual load-blend-store, there's the _mm_blend_ps/blendps and _mm_blendv_ps/blendvps instructions in SSE4.1, which can aid the process, although that kind of load followed by a store can be a large performance impact. For earlier than SSE4.1, that kind of blend between registers can be achieved by a sequence of and+andnot+or instructions.

I recommend the Intel Intrinsic Guide, which has the instructions in an easily searchable format.

#4954115 Cross-Platform and OpenGL ES

Posted by clb on 29 June 2012 - 04:11 PM

I am developing a cross-platform graphics API abstraction that runs against D3D11, OpenGL3, GLES2 and WebGL (gfxapi, and its platform coverage matrix). OpenGL3, GLES2 and WebGL are so close to each other, that one can comfortably share the codebase for each. In my codebase there are minimal #ifdef cases for GLES2, mostly related to added checks for potentially unsupported features (e.g. non-pow2 mipmapping).

NaCl, iOS and Android all use GLES2, so you can get to a lot of platforms with the same API.

#4953470 VBO what does GPU prefers?

Posted by clb on 27 June 2012 - 04:03 PM

The first example is called an 'interleaved' format, and the second example is called a 'planar' format.

I am under the belief that interleaved formats are always faster, and I cannot remember any source ever that would have recommended planar vertex buffer layout for GPU performance reasons. There are tons of sources that recommend using interleaved data, e.g. Apple OpenGL ES Best Practices documentation. On every platform with a GPU chip I have programmed for (PC, PSP, Nintendo DS, Android, iOS ..), interleaved data has been preferred.

There is one (possibly slight) benefit for planar data, namely that it compresses better on disk than interleaved data. It is a common data compression technique to group similar data together, since it allows compressors to detect similar data better. E.g. the crunch library takes advantage of this effect in the context of textures and reorders the internal on-disk memory layout to be planar before compression.

#4953078 Test point inside triangle, 3D space

Posted by clb on 26 June 2012 - 11:47 AM

MathGeoLib contains an implementation of some code testing whether a 3D triangle contains a given point. See Triangle::Contains. To get to the implementation, click on the small [x lines of code] link at the top of the page. The code is adapted from Christer Ericson's Real-Time Collision Detection.

#4950526 Strange artefacts when rendering texture fonts form freetype

Posted by clb on 19 June 2012 - 03:31 AM

I think you have a off-by-one error in the code. The line
  char p = ((char*)bmp.buffer)[x+((bmp.rows-y)*bmp.width)];

looks like it should instead be

  char p = ((char*)bmp.buffer)[x+((bmp.rows-1-y)*bmp.width)];

#4950324 Strange artefacts when rendering texture fonts form freetype

Posted by clb on 18 June 2012 - 12:19 PM

Try using gDEBugger to take a snapshot of the texture in GPU memory, and see what the pixel contents are. I'm using FreeType2 for my fonts, and haven't observed such an artifact. Perhaps there's an off-by-one copying error occurring somewhere in the code, and the texture actually does contain that row of pixels in GPU memory, or you address one line too low?

#4950252 High performance texture splatting?

Posted by clb on 18 June 2012 - 08:10 AM

Perhaps try avoiding the mix and optimize the code manually:

void main()
lowp vec4 alpha = texture2D(texture4, v_texcoord);

lowp vec4 color0 = texture2D(texture0, v_texcoord);
lowp vec4 color1 = texture2D(texture1, v_texcoord);
lowp vec4 color2 = texture2D(texture2, v_texcoord);
lowp vec4 color3 = texture2D(texture3, v_texcoord);

gl_FragColor = v_color * (alpha[0] * color0 + alpha[1] * color1 + alpha[2] * color2 + alpha[3] * color3);

(I reindexed the way how the indices of the alpha vector affect the read color texture for straightforwardness). The idea is that alpha[0] is already precomputed to be 1.0f - alpha[1] - alpha[2] - alpha[3] in the texture, so one doesn't need to compute that in the shader. I feel this would be faster than using mix(), but can't be sure without profiling. Let me know how it compares.

Something that's potentially optimizable is to drop one or two texture channels to splat, and subdivide your mesh down by which splat textures it is using at each triangle. Also, if the splat texture is low-frequency, try storing the splat weights as vertex attributes and pass them through to pixel shader, which will avoid you one texture read.

Finally, if the splat texture is very low frequency, you can try just decaling the contents, i.e. manually generate geometry planes that you alphablend on top of the terrain.

#4950215 Future-proof technologies to start learning now

Posted by clb on 18 June 2012 - 05:57 AM

C#, Java, C/C++, Objective-C/C++, HTML(5), CSS, XML, Sockets, JavaScript, Python, OpenGL3, GLES2, Direct3D11 are all keywords that you can see desired in games-related job applications today. Android and iOS experience is very hot for several games companies.

Off the top of my head, some technologies I can think of phasing out are D3D9, MDX, OpenGL2, GLES1, XNA, Symbian.

Qt is a bit of an interesting case - Qt for mobile is pretty dead with Nokia, but for desktop and non-games/non-3D it's still strongly alive.

#4950080 Which OpenGL version?

Posted by clb on 17 June 2012 - 02:18 PM

I've done interviews for programmers, and my thought is that I don't care if you've done OGL2 or OGL3. The more important things related to GPUs I test is whether they can do shaders or not, and whether they understand the concept of writing code that's executed by two separate processing cores (cpu & gpu) in parallel, and what kind of implications that has, performance and code structure-wise.

As a programmer, I don't touch OpenGL 2 at all. In my hobby engine where I don't care about legacy compatibility, it's just Direct3D11 and OpenGL3. That takes a lot of headache away, and keeps things simpler.

#4950064 Struct Initialization Within Struct

Posted by clb on 17 June 2012 - 01:22 PM

Yes, but unfortunately C++ doesn't allow you to conveniently do the initialization like you do on line 7. You'll have to do it in a constructor, like as follows:

[source lang="cpp"]struct A{ int iSize; A(int x = 10) { iSize = x; }}; // Astruct B{ A myA; B() :myA(2) { }}; // B[/source]

#4950050 Trouble understanding Gimbal Lock from a mathematical perspective

Posted by clb on 17 June 2012 - 11:24 AM

In that example, the writer describes gimbal lock in the context of representing rotations using euler angles. With euler angles, one specifies rotations as a sequence of three successive rotations around predetermined axes. These convention of which axes to use and in which order are chosen arbitrarily, or depending on the application, e.g. one might represent orientation as rotation about X, then Y and then the Z axis.

Since you have three scalars there you can specify rotations with, you have 3 degrees of freedom. If you fix any one of these to a constant value (say y = 25deg), you'd expect to be left with the ability to rotate still in 2 degrees of freedom, because you can still freely manipulate two rotation angles (and so on, if you fix two of these angles to a constant value, you'd expect to be left with the ability to rotate in 1 DOF). But due to the gimbal lock effect, this is not always the case. It is possible to fix only one of the angles to a specific constant value, that constrains the system to only 1 degree of freedom left, instead of the 2 dof that you'd expect.

The problem here is that whichever order we pick for the Euler angles, there exists an angle value for the middle rotation, that causes the first and the third axis to line up (the angle is +/-90degrees, depending on the choice of rotation order), so that varying the angle values for the first and the third rotation axis will both produce an end result rotation about the same axis. If we fix the second rotation axis constant to this value that causes the first and the third axes to line up, we not only freeze rotation about one dimension, but about two dimensions, since we are left with the ability to only rotate the whole object around one axis. Effectively, our expected to-be-2dof system is now only a 1dof system. Altering the angle value for the middle rotation axis immediately breaks off the gimbal lock, giving back the 3dof rotations.

Mathematically we can see this as follows. Let's represent rotation using Euler angles, R = Rx(a) * Ry(b) * Rz©, where Ri(v) is a rotation matrix about axis i by angle v, and use the Matrix*vector convention. Fix b to -90 degrees (I think, or, b=90 if I messed up the sign). Then it can be seen that Ry(-90deg) * Rz© == Rx©. Therefore when b=-90, the rotation equation for the remaining two axes is R=Rx(a) * Rx© = Rx(a+c).

What this equation means is that we have two free scalars a and c left to specify the orientation, but they are both producing rotation about the x axis, i.e. the altering a and c is interchangeable and has the same end effect of rotation about the x axis. Effectively we fixed only one rotation axis to a special value, but managed to kill two degrees of freedom with that move.

It should be remembered that there's nothing intrinsically wrong with using Euler angles, just that having that middle value be +/-90 can be problematic. If you don't need those kind of angles, e.g. in a FPS shooter you can't tilt your head above 90 deg to look backwards, so you can pick such an Euler convention (XYZ/XZY/ZYX/etc. depending on your cooridinate system) for the camera that has the constrained rotation axis in the middle, and you are not going to have any problems.

The other solution (in game development field) to this is to not do your rotation logic as sequences of rotations about fixed axes. Using quaternions implicitly avoids this, not because of some mathematical special property they have, but because with quaternions you don't logically use sequences of fixed axes. Or if you do, and just have QuatYaw, QuatPitch and QuatRoll, you're no better off than when you were using euler angles and are still susceptible to gimbal lock.

#4949980 OpenGL performance question

Posted by clb on 17 June 2012 - 03:26 AM

I use the second approach (although with glDrawElements). Performance is not a problem at the moment (I can do hundreds of UI windows), and if it gets too slow, I'll investigate whether batching manually might help.

In the first approach, it might not necessary to update the whole VB if one rectangle changes - you could update a sub-part of the vertex buffer, if you keep track of which UI element is at which index. Although, I've got to say in my codebase that might get a bit trickier than it sounds, since I'm double-buffering my dynamically updated VBs manually (which I have observed to give a performance benefit on GLES2 even when GL_STREAM_DRAW is being used), so the sub-updates should be made aware of double-buffering.