Jump to content

  • Log In with Google      Sign In   
  • Create Account

Banner advertising on our site currently available from just $5!

1. Learn about the promo. 2. Sign up for GDNet+. 3. Set up your advert!


Member Since 30 Aug 2006
Offline Last Active Jun 17 2015 12:10 AM

#5235077 Slow down when using OpenMP with stl vector per thread

Posted by JoeJ on 16 June 2015 - 04:05 AM

I think, if you want to stick to the STL, you should replace the set with a priority queue and see what happens


That's it! biggrin.png

I'm not yet 100% sure i did that correctly, but results match and it's much faster already single threaded and also gives speed up using OMP.

On a six core, priority_queue method takes 0.4 seconds ST and 0.07 seconds MT - just perfect smile.png

(set method timings have been 1.4 / 2 seconds)





but rather than using an all-pairs algorithm, you are using a single-source algorithm and repeating it for every pair.

Just to try out if the idea works and Dijkstra was already implemted. There was a note above the code that i already replaced this with a fast method.


I've tried the other minor things you have suggested, but they did not differ measureable in runtime.

I'ts a result of being lazy, paranoid and thinking "compiler should be clever enough to detect this" ;D






I really appreciate your help guys!


In the future I'll look more closely what happens under the hood of STL (use it very rarely).




Here the changes:


EDIT: Fixed a bug - Just saw Pink Horror has already noticed when posting :)

    struct FirstFar
        int operator() (const std::pair<int,float> &first,
                        const std::pair<int,float> &second)
            return first.second > second.second;

    static void DijkstraComputePaths2 (int source,
                              const adjacency_list_t &adjacency_list,
                              std::vector<float> &min_distance,
                              std::vector<int> &previous)
        int n = adjacency_list.size();
        min_distance.resize(n, max_weight);
        min_distance[source] = 0;
        previous.resize(n, -1);
        std::priority_queue< std::pair<int,float>, std::vector<std::pair<int,float> >, FirstFar> vertex_queue;
        vertex_queue.push (std::make_pair(source, min_distance[source]));

        while (!vertex_queue.empty())
            int u = vertex_queue.top().first;
            float dist = vertex_queue.top().second;
            // Visit each edge exiting u
            const std::vector<Neighbour> &neighbors = adjacency_list[u];
            for (std::vector<Neighbour>::const_iterator neighbor_iter = neighbors.begin();
                 neighbor_iter != neighbors.end();
                int v = neighbor_iter->target;
                float weight = neighbor_iter->weight;
                float distance_through_u = dist + weight;
                if (distance_through_u < min_distance[v])
                    min_distance[v] = distance_through_u;
                    previous[v] = u;
                    vertex_queue.push(std::make_pair(v, min_distance[v]));

#5231845 Automatic UV map generation: OpenNL LSCM doesnt work - why ?

Posted by JoeJ on 30 May 2015 - 09:26 AM

It looks like it does not work on the bunny because it's a closed mesh without boundary - maybe you first need to provide or create seems somehow?


However, there is another library called OpenGI you might wanna check: http://opengi.sourceforge.net/


I have not tried any af those two yet.

I work on something similar (segment a mesh to convex parts) for more than a month now and it's very hard.

Pls let me know if you find a useful library...

#5225229 Better way for geometry problem

Posted by JoeJ on 24 April 2015 - 09:33 AM

I think it should work without intersection tests.

You can get the solid angle from a sphere to camera by: solidAngle = asin (sphereRadius / distanceToCamera)

You also need the angle between both sphere centers in relation to camera: angleAtoD = acos((posE-cam).Unit().Dot((posD-cam).Unit()))),

then it should be possible to do this:


vector diff = posD - posA.

posE = posA + diff * solidAngleA / angleAtoD

posF = posD - diff * solidAngleD / angleAtoD


EDIT: Removed all those divides by 2 from pseudo code - that was wrong thinking.

EDIT2: Fixed another bug.


In case there are still bugs, here is how it is supposed to work:


All the rays from the camera are straight lines.

Also the line between the spheres is straight.

Thus their interesctions are linear related to the angles, and the angles are easy to calculate here.

#5207452 Using torque to rotate an object into a desired orientation

Posted by JoeJ on 29 January 2015 - 10:14 AM

Using PID should not be necessary for a simple problem like that -
they are hard to tune right and mostly there's a better analytical solution.
Torques ar harder than forces so first a simple example,
Calculate force to reach a target:

Vec3 targetLinvel = (targetPos - body.com) / timestep; // desired linear velocity to move target distance in one step
targetLinvel -= body.linearVelocity; // subtract current velocity
Vec3 force = 0.3 * targetLinvel * (body.mass / timestep); // soften by some factor <1 to avoid oszillation

This should work and you have only a single tuning factor, which is easy to understand and predict.

Now the same for orientation:

Vec3 axis; float angle;
ToAxisAndAngle (&axis, &angle, object_forward, desired_forward); // normalized axis and radians

Vec3 targetAngvel = axis * angle / timestep;
targetAngvel -= body.angularVelocity;

Vec3 torque = targetAngvel / timestep;
torque = body.WorldSpaceMatrix.Unrotate (torque); // rotate to body local space...
torque.x *= body.InertiaXX; //... so we can apply 3D inertia in the space where it is defined
torque.y *= body.InertiaYY;
torque.z *= body.InertiaZZ;
torque = body.WorldSpaceMatrix.Rotate (torque); // rotate back to world space
torque *= 0.3; // avoid oszillation 

#5201483 Using a skydome as an irradiance map

Posted by JoeJ on 03 January 2015 - 02:55 AM

But I am thinking about getting the vertex in the skydome that comes closest to the normal ray


That would not be correct, because the entire hemisphere around the normal contributes to the lighting of diffuse materials.

If you use a cube map for the skydome, then one half of all its texels would contribute (assuming no occlusion).


The factor for each texel would be cos(angle) * texelArea / halfSphereSurfaceArea.


Where angle is the normal angle,

halfSphereSurfaceArea is 2*PI for a unit sphere,

and texelArea bepends on what mapping you use (Enviroment map, cube map, etc...) and might be stored in alpha.


For sure you can precompute the whole map and then bilinear sampling with normal would be fine.

It's similar to a blured version, so spherical harmonics could store it as well and you could replace a texture fetch with shader constants / parameters.





#5201061 Need help understanding how to offset rays by specific angles

Posted by JoeJ on 31 December 2014 - 04:51 PM

You could use spherical coords to make a random unit vector, using two random angles:


unitDir.z = -sin(theta) * cos(phi);
unitDir.x = -sin(theta) * sin(phi);
unitDir.y = cos(theta);


To keep those random sample directions in the hemisphere, simply invert if dot product with normal is negative.

To sample a spherical area light you get the enclosing solid angle by asin (sphereRad / sphereDistance).

With this you can get random directions inside the cone that points to the sphere if it would be along the normal.


To rotate this 'cone of sample directions' to the sphere position, you get the rotaion axis by:

normalize(cross(normal, sphereDir))


... and the rotation angle by:

dot(cross(normal, sphereDir))


(sphereDir is unit vector from sample position to sphere position).


Looking at the Rodrigues Formula it makes me think it's the same as rotating a vector by a quaternion, but i did not look close.

Depending on how much sample directions you use, a 3x3 rotation matrix might be faster.

All in all a lot of trig, maybe some one knows faster methods...




Recently i made this simple random function and it performed surprisingly well - may be faster than precomputed poisson samples.

It does not need to store a seed so its very GPU friendly.



inline unsigned int randI (unsigned int time)
        time *= 1664525;
        time ^= (time << 16);
        time *= 16807;
        return time;
    inline float randF (unsigned int time)
        unsigned int s = randI(time);
        return float(s) / float(0x100000000LL);

#5200293 Compute shader: varying the number of threads

Posted by JoeJ on 27 December 2014 - 01:35 PM

Tools like AMD CodeXL helped me to understand why choosing another thread size can affect performance.

For Nvidia there is something similar - Nsight, which i have not tried yet.

The bad thing is that this depends on chip type, so it might make sense to compile the shader with the size dependent on that.


There are however cases, where you don't have a choice anyways.

F. ex. You have nodes with varying number of neighbours - after the neighbour number has been found, you would bin them into groups of <=64, <=128, <=256.

Then you launch the same shader compiled three times with thread sizes of 64, 128, 256, which does something like:


if (threadID < numNeighbours) LDS[threadID] = neighbourDataFromGlobalMemory[binGroupIndicesBuffer[globalID]]; // move all data to LDS memory so all threads have fast access to it

// do some work like sorting by distance...


Downsides of this are:

You have only 3/4 of threads busy on average.

You need to invoke 3 shaders instead one (and your API / Hardware may not allow to execute them simultaneously).


But because you can solve the problem in fast LDS it will be the fastest way to do it, if you have enough nodes to process.

#5191669 Is there a known bug with glMemoryBarrier() when using the latest AMD Catalys...

Posted by JoeJ on 07 November 2014 - 08:45 AM

Same happens to me here on 280x sad.png

It compiles when using Intel CPU driver, but also does not when using AMD CPU driver.


Now you have a nice collection of driver bugs to submit.


I hope AMD soon releases a new driver with generic 2.0 support and fixing this.

Also when both AMD and Intel have 2.0 then, maybe NV is willing to update their OpenCL too.


Edit: Same for cl_khr_spir, cl_khr_image2d_from_buffer, cl_khr_image2d_from_buffer, cl_khr_dx9_media_sharing, ... ?


What a mess



See http://devgurus.amd.com/thread/155539

Very similar problem, so maybe you can just remove the pragma line and use the extension?

#5191533 Is there a known bug with glMemoryBarrier() when using the latest AMD Catalys...

Posted by JoeJ on 06 November 2014 - 10:02 AM

Maybe this one:




I'm not sure because i also dwnloaded OpenCL 2.0 beta drivers (which did not work for me, because they need win 8).

But i believe this is the right file - it may be a beta driver, but i don't think so.


I'm using R9 280x

#5191529 Is there a known bug with glMemoryBarrier() when using the latest AMD Catalys...

Posted by JoeJ on 06 November 2014 - 09:35 AM

What i thought is using ping pong image, meaning CL writes to imgB while GL reads from imgA, swapping pointers at the start of each frame.

The image should contain only the stuff that has been updated, so i still need a compute shader to copy this data to the very large full scene lightmap.

But there should be no sync issues, do you agree? Tha lag of one frame should not matter for me.


Edit: cl_khr_gl_event is supported on AMD according to my log:


Selected Platform Vendor: Advanced Micro Devices, Inc.
Version: OpenCL 1.2 AMD-APP (1573.4)

Device: Tahiti
Version: OpenCL 1.2 AMD-APP (1573.4)
Adress Bits: 32
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event
Global Cache Size: 16384
Cache Line: 64
Local Memory Size: 32768
Max Work Group Size: 256
Max Work Item Sizes: [256,256,256]
Max Image Width: 16384

#5191521 Is there a known bug with glMemoryBarrier() when using the latest AMD Catalys...

Posted by JoeJ on 06 November 2014 - 08:27 AM

I remember better now - in deed i needed to add the fence when changing from NV to AMD. For NV the glMemoryBarrier() alone has worked.

Using the fence then everything worked as expected, no need for coherent.


I hope moving data to GL will not expensive for me too. I plan to share an image between CL and GL and assumed there should be no slow down.

#5191395 Is there a known bug with glMemoryBarrier() when using the latest AMD Catalys...

Posted by JoeJ on 05 November 2014 - 01:55 PM

I've had similar issues on NV cards too, code below is that i ended up using to call after glDispatchCompute.

Just to let you know, i gave up on compute shaders for now and use OpenCL because it's much faster.
I do realtime GI and my stuff covers a lot of different algorithms from complex tree traversal to simple brute force stuff.
On NV cards OpenCL is about twice as fast than Compute shader (!).
On ATI OpenCL is about 20% faster.

And... ATI R9 280X is 2.3x faster than Geforce Titan.

void GPUwait ()
	GLsync syncObject = glFenceSync (GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
	GLenum ret = glClientWaitSync(syncObject, GL_SYNC_FLUSH_COMMANDS_BIT, 1000*1000*1000);
	if (ret == GL_WAIT_FAILED || ret == GL_TIMEOUT_EXPIRED)
		SystemTools::Log ("glClientWaitSync failed./n");
	glMemoryBarrier (GL_ALL_BARRIER_BITS);
	glDeleteSync (syncObject);

#5182229 Compute shader runs more than once?!

Posted by JoeJ on 22 September 2014 - 02:59 PM

At least this problem helped me to solve mine :)


I've not read much yet about sync objects, but adding it to my code fixed my synchrinisation issues:


void GPUwait ()
    GLsync syncObject = glFenceSync (GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
    GLenum ret = glClientWaitSync(syncObject, GL_SYNC_FLUSH_COMMANDS_BIT, 1000*1000*1000);
    if (ret == GL_WAIT_FAILED || ret == GL_TIMEOUT_EXPIRED)
        SystemTools::Log ("glClientWaitSync failed./n");
    glMemoryBarrier (GL_ALL_BARRIER_BITS);
    glDeleteSync (syncObject);


Maybe you should try to add the glMemoryBarrier (i assumed this alone should block until shader is finished, but seems i was wrong)

And in my paranoia i've made all my data coherent in shader:


layout(std430, binding = 0) coherent buffer


Just try, i don't know what i'm talking about :)

My confusion raised too high the last time and i went to OpenCL... seems much easier to learn GPGPU.

#5162992 Can't use imageBuffer with integer data (Compute Shader)?

Posted by JoeJ on 26 June 2014 - 05:47 AM

Ooops, solved:


layout (binding = 4, rgba32ui) uniform uimageBuffer TpackInt;


...i just forgot to add the u rolleyes.gif

#5065328 How do I get the euler angle from a matrix?

Posted by JoeJ on 27 May 2013 - 01:53 PM

I once extended that so it can handle various orders. I did not check if it works with orders like XYX, but XYZ is ok.

Note that it's for OpenGL matrices, so you need to swap matrix indices.


static v3f ToEulerAngles (matrix4f &m, int order = 0x012) // 0x012 = xyz, 0x201 = zxy and so on...

  int a0 = (order>>8)&3;
  int a1 = (order>>4)&3;
  int a2 = (order>>0)&3;

  v3f euler;
  // Assuming the angles are in radians.
  if (m.m[a0][a2] > 0.9999)
  { // singularity at north pole
   euler[a0] = -atan2(m.m[a2][a1], m.m[a1][a1]);
   euler[a1] = -PI/2;
   euler[a2] = 0;
   return euler;
  if (m.m[a0][a2] < -0.9999)
  { // singularity at south pole
   euler[a0] = -atan2(m.m[a2][a1], m.m[a1][a1]);
   euler[a1] = PI/2;
   euler[a2] = 0;
   return euler;
  euler[a0] = -atan2(-m.m[a1][a2], m.m[a2][a2]);
  euler[a1] = -asin ( m.m[a0][a2]);
  euler[a2] = -atan2(-m.m[a0][a1], m.m[a0][a0]);
  return euler;