Problems with partial OpenCL kernel dispatch

Started by
4 comments, last by Xycaleth 9 years, 1 month ago

I'm writing a basic OpenCL path tracer but I'm having problems with the dispatched kernel only being partially completed depending on how many times a loop iterates in the kernel. I'm dispatching 1024x1024 work items, with each work group having 8x8 work items.

I've reduced the kernel code to the following to still reproduce the issue.
[source=c++]kernel void ray_trace(read_only global SceneTriangle *triangles, int numTriangles,
read_only image2d_t rays, write_only image2d_t out)
{
size_t x = get_global_id(0);
size_t y = get_global_id(1);

float3 ray = read_imagef(rays, (int2)(x, y)).xyz;

const float3 cameraPosition = (float3)(0.0f);
const float3 lightPosition = cameraPosition + (float3)(0.1f, 0.0f, 0.0f);

float4 result = (float4)(0.0f, 0.0f, 0.0f, 1.0f);

#define MAX_PATHS 4
for ( int path = 0; path < MAX_PATHS; path++ )
{
RayTriangleTestResult closestResult;
closestResult.intersects = true;
closestResult.t = 999999.9f;

for ( int i = 0; i < numTriangles; i++ )
{
RayTriangleTestResult testResult = test_ray_triangle(triangles.v[0].position.xyz, triangles.v[1].position.xyz, triangles.v[2].position.xyz, cameraPosition, ray);
}

result.xyz += 0.1f;
}

result.xyz /= MAX_PATHS;
result.xyz = powr(result.xyz, (float3)(1.0f / 2.2f));

write_imagef(out, (int2)(x, y), result);
}[/source]

The host-side code is as follows:
[source=c++]const size_t globalWorkSize[] = { imageDesc.image_width, imageDesc.image_height };
const size_t localWorkSize[] = { 8, 8 };
error = clEnqueueNDRangeKernel(cmdQueue, rayTrace, 2, NULL, globalWorkSize, localWorkSize, (int)ArraySize(waitOn), waitOn, &traceEvent);
if ( error == CL_SUCCESS )
{
size_t imagePitch;
size_t imageSlice;
void *imageData = clEnqueueMapImage(cmdQueue, outImage, CL_TRUE, CL_MAP_READ, origin, region, &imagePitch, &imageSlice, 1, &traceEvent, NULL, &error);
if ( imageData && error == CL_SUCCESS )
{
stbi_write_png("output.png", (int)imageDesc.image_width, (int)imageDesc.image_height, 4, imageData, 0);
}
}[/source]

The image being written to is 1024x1024 in size (the same as the number of work items) so the expected result is that the image is filled with a flat grey colour. However, if I increase MAX_PATHS then the kernel doesn't fully complete. The image is only partially filled with the grey colour I expect, and I can double check this by adding a printf to the kernel to print out the work item that's being executed. If I increase MAX_PATHS but reduce numTriangles then I can get a full dispatch. Also, if I remove the inner loop then I also get a full dispatch.

Here's what the partially filled image looks like:
dhfiop.jpg

I'm developing this path tracer on OS X, with the following limits:


Device 0 (HD Graphics 4000):
  Global mem size:       1073741824 bytes (1024.000 MB)
  Local mem size:        65536 bytes (0.062 MB)
  Max clock frequency:   1150 MHz
  Max compute units:     16
  Max alloc size:        268435456 bytes (256.000 MB)
  Max workgroup size:    512
  Max work item size:    (512 512 512)

It doesn't look like I'm running into any of the limits listed above.

Does anybody have any idea what could be causing the problem?

Advertisement

Wait, you can specify read_only on globals? It was my understanding it was for image objects only.

Parameter 7 to clEnqueueNDRangeKernel is currently (int)ArraySize(waitOn). Leaving aside it is a cl_uint, the pointed events must complete and I have no idea what is going on with them.

Considering png typically goes with integers I would also check out the way you mangle the resulting data.

Ultimately, some drivers have watchdogs and will kill dispatches if they take too much time to run. Considering the inner loop seems to be doing nothing (the value is trashed right away) I think that's fairly indicative. Am I missing some side effect?

Previously "Krohm"

Wait, you can specify read_only on globals? It was my understanding it was for image objects only.

You're right, it's only for images. I'm still new to OpenCL so there's bound to be mistakes!

Parameter 7 to clEnqueueNDRangeKernel is currently (int)ArraySize(waitOn). Leaving aside it is a cl_uint, the pointed events must complete and I have no idea what is going on with them.

Thanks, fixed this now as well. The events I'm waiting on are generated by previous enqueued commands (clearing the image to black, and generating the primary rays).

Considering png typically goes with integers I would also check out the way you mangle the resulting data.

I don't think there's any problems with writing the PNG image. I did have the cornell box rendered with just direct lighting and correctly written to the png image. It's just when I started increasing the complexity that I hit this problem.

Ultimately, some drivers have watchdogs and will kill dispatches if they take too much time to run. Considering the inner loop seems to be doing nothing (the value is trashed right away) I think that's fairly indicative. Am I missing some side effect?

It could be that the work items are being killed if that's the case. The loop doesn't have any side effects, but I needed to keep it in there so that it would still reproduce the problem. The loop did actually do something with the test result smile.png

I had a look in the OSX system info log and it looks like my kernel was being killed for taking too long. Thanks for pointing me in the right direction, Krohm!

Thank you for the feedback. Good to know those get logged somewhere. Is that an OSX thing or something in the driver?

Previously "Krohm"

I think it's a driver thing but it gets logged to a central system log. You can find them in the "Console" application. It shows a GPU restart occurred and a load of information about why it was restarted and a load of diagnostic info too.

This topic is closed to new replies.

Advertisement