I'm writing a basic OpenCL path tracer but I'm having problems with the dispatched kernel only being partially completed depending on how many times a loop iterates in the kernel. I'm dispatching 1024x1024 work items, with each work group having 8x8 work items.
I've reduced the kernel code to the following to still reproduce the issue.
[source=c++]kernel void ray_trace(read_only global SceneTriangle *triangles, int numTriangles,
read_only image2d_t rays, write_only image2d_t out)
{
size_t x = get_global_id(0);
size_t y = get_global_id(1);
float3 ray = read_imagef(rays, (int2)(x, y)).xyz;
const float3 cameraPosition = (float3)(0.0f);
const float3 lightPosition = cameraPosition + (float3)(0.1f, 0.0f, 0.0f);
float4 result = (float4)(0.0f, 0.0f, 0.0f, 1.0f);
#define MAX_PATHS 4
for ( int path = 0; path < MAX_PATHS; path++ )
{
RayTriangleTestResult closestResult;
closestResult.intersects = true;
closestResult.t = 999999.9f;
for ( int i = 0; i < numTriangles; i++ )
{
RayTriangleTestResult testResult = test_ray_triangle(triangles.v[0].position.xyz, triangles.v[1].position.xyz, triangles.v[2].position.xyz, cameraPosition, ray);
}
result.xyz += 0.1f;
}
result.xyz /= MAX_PATHS;
result.xyz = powr(result.xyz, (float3)(1.0f / 2.2f));
write_imagef(out, (int2)(x, y), result);
}[/source]
The host-side code is as follows:
[source=c++]const size_t globalWorkSize[] = { imageDesc.image_width, imageDesc.image_height };
const size_t localWorkSize[] = { 8, 8 };
error = clEnqueueNDRangeKernel(cmdQueue, rayTrace, 2, NULL, globalWorkSize, localWorkSize, (int)ArraySize(waitOn), waitOn, &traceEvent);
if ( error == CL_SUCCESS )
{
size_t imagePitch;
size_t imageSlice;
void *imageData = clEnqueueMapImage(cmdQueue, outImage, CL_TRUE, CL_MAP_READ, origin, region, &imagePitch, &imageSlice, 1, &traceEvent, NULL, &error);
if ( imageData && error == CL_SUCCESS )
{
stbi_write_png("output.png", (int)imageDesc.image_width, (int)imageDesc.image_height, 4, imageData, 0);
}
}[/source]
The image being written to is 1024x1024 in size (the same as the number of work items) so the expected result is that the image is filled with a flat grey colour. However, if I increase MAX_PATHS then the kernel doesn't fully complete. The image is only partially filled with the grey colour I expect, and I can double check this by adding a printf to the kernel to print out the work item that's being executed. If I increase MAX_PATHS but reduce numTriangles then I can get a full dispatch. Also, if I remove the inner loop then I also get a full dispatch.
Here's what the partially filled image looks like:
I'm developing this path tracer on OS X, with the following limits:
Device 0 (HD Graphics 4000):
Global mem size: 1073741824 bytes (1024.000 MB)
Local mem size: 65536 bytes (0.062 MB)
Max clock frequency: 1150 MHz
Max compute units: 16
Max alloc size: 268435456 bytes (256.000 MB)
Max workgroup size: 512
Max work item size: (512 512 512)
It doesn't look like I'm running into any of the limits listed above.
Does anybody have any idea what could be causing the problem?