• Content count

  • Joined

  • Last visited

Community Reputation

189 Neutral

About tmarques4

  • Rank
  1. Thanks for the tips Ohforf sake and jms bc. By doing some research I've learned more about the textures and there's a CUDA binding called surface which allows write acess to texture memory, it's seems like a good solution to my problem.   My only issue right know is that I've already started implementing the shared cache version of my code, so changing yet again to another approach would cost me a lot of time (I'd have to implement a work around those 130MB limit for a single 1D texture), so, for now, I'll test shared memory and check the speed-up, if it's not good enough I'll switch to texture memory.   Also, didn't know CUDA supported half float variables (thought that was only an OpenCL feature), so I'll switch to half since it could effectivelly double the reach of a single CUDA block to the mesh values.   Thanks again, learned a lot.
  2.   Thanks for the suggestion but, correct me if I'm wrong, texture memory is a read-only (my mesh has to change it's value at each iteration) and special type of global memory. So shared memory acess is still going to be faster than texture memory.   Also, from looking at the simpleCubemapTexture sample, texture memory would seem to suit better for surfaces, not volumes as in my case and there's also memory size limitations I'd have to consider for a single texture, the meshes in my code can easily reach 100MB.
  3.   Good, didn't know that. If many blocks are allowed to run in parallel, does it mean each block will have it's unique shared memory?     The program output says it a Tesla 2090, it's compute capability is 2.0. The parameters on the kernel are float *mesh3D and int time and the "factor functions" are two actually; Laplace and Chemotaxis; their parameters are the surrouding values (top, bottom, left, right, front, back and middle).   Your guess is correct with the exception that there are 7 samplings. Also, each point which belongs to the grid has 7 cells (the grid is actually a microscopic portion of human skin tissue), so it's 7*7*N redundant acesses.   Laplace and Chemotaxis are calculated on each cell on a mesh position, so there are lots of acesses to global values. I want to move these values to a faster memory before Laplace and Chemotaxis start acessing them.
  4. Frustum Culling

    Assuming you know the point of intersection (the corner) between PB and PL (let's call it PI), you can calculate a vector V = (P - PI). If (length(V) < sphereRadius) your point is inside, otherwise, check the dot product of normalized(V) with PB and PL plane normals, if and only if both values are negative, point is outside, if the comparison fails, it's inside.   Take this with a grain of salt since I've never tested or even considered this case when doing frustum culling, but this would be my first guess to solve this issue.   Hope it helped.
  5.   That's what I'm doing, prior to start executing the program I load static settings (i.e mesh size, mesh length...) as constants using cudaMemcpySymbol and just send the mesh and time parameters with the kernel. I'm trying to minimize transfers from host to device as much as possible.     I didn't know about shared memory (I'm basically a starter in CUDA) and, by doing some research over the internet, I learned the maximum cache size for each thread block is around 16-32KB. My kernels run 1024 threads per block, needing 60KB, therefore, to make it work with my kernel I'd have to reduce the number of threads per block from 1024 to 512, is this generally a good tradeoff?
  6.   Those values aren't actually global memory, they are private. I'm very sorry for the confusion and will try to provide some context:   I have a 3D mesh with each position representing a float value and I use GPGPU to update this mesh T times.   //This is the function called from host to device. __global__ void Update3DMesh(float *mesh, int time) { //Retrieve the index from the thread ID that represents the position being updated. //Retrieve the center, top, bottom, left, right, front, back position values relative to the index. //Compute N factors using these surrounding position values according to time. float factor1 = ComputeFactor1(parameters); float factor2 = ComputeFactor2(parameters); ... float factorN = ComputeFactorN(parameters); //Update the position. float[index] = factor1+factor2 ... +factorN; } I'm struggling to figure the best way for ComputeFactor to receive the parameters. I thought that maybe packing all these surrounding positions into a (float *) vector and passing the pointer (sizeof(float *) bytes) instead of passing all positions separatelly (sizeof(float)*7 bytes) would result in a faster program. I'll just have to compare both methods.   Appreciate the responses, Khatharr and Bacterius.
  7. Hello,   I'm working on a CUDA kernel and something interesting, I guess, crossed my mind. Maybe you could help me.   Say I have those two kernels:   __device__ float SimpleKernel1(float value1, float value2, float value3 ...float valueN) { return value1 + value2 + value3 ... + valueN; }   __device__ float SimpleKernel2(float *values) { return value[0] + value[1] + value[2] ... + value[N]; }   Would SimpleKernel2 run faster? I know there are lots of factors in play (i.e memory interface, clock, number of threads) but thinking as generic as possible, kernel1 function call sends sizeof(float)*N bytes while kernel2 function call only sends sizeof(float *) bytes, so maybe this would results in great speedups. Is this right or wrong, does it really matter?   Thanks!
  8. Ok, so I was able to solve the issue myself by searching the AMD forums.   Here's the solution if anyone else stumbles upon this:   My program was linking to OpenCL from "/opt/AMDAPP/lib/x86_64/" so, instead, I tried linking to "usr/lib/fglrx/" and it worked. Now I can build the context and acess OpenGL VBOs.
  9. Hello,   I'm trying to establish interop between CL and GL, the tutorials are very straightforward and everything looks to be set correctly, however, when I try to get a cl_mem from a VBO index, the error CL_INVALID_GL_OBJECT appears and I don't know what I'm doing wrong.   #include <SDL/SDL.h> #include <GL/glew.h> #include <GL/glx.h> #include <stdio.h> #ifdef __APPLE__ #include <OpenCL/opencl.h> #else #include <CL/cl.h> #include <CL/cl_gl.h> #endif //#define FULLSCREEN int main(int argc, char *argv[]) { //****************** //Setting up window. //****************** int videoFlags; SDL_VideoInfo *videoInfo; SDL_Init(SDL_INIT_VIDEO); videoInfo = (SDL_VideoInfo *)SDL_GetVideoInfo(); if(!videoInfo) { printf("Error: SDL failed to determine video info."); return false; } videoFlags = SDL_OPENGL | SDL_GL_DOUBLEBUFFER | SDL_HWSURFACE; #ifdef FULLSCREEN videoFlags += SDL_FULLSCREEN; #endif SDL_SetVideoMode(videoInfo->current_w, videoInfo->current_h, videoInfo->vfmt->BitsPerPixel, videoFlags); //************************ //Setting up openGL state. //************************ glewInit(); //****************************************************** //Setting up openCL state and share context with openGL. //****************************************************** cl_int state; cl_platform_id platform[100]; cl_context context; cl_device_id devices[100]; //Get platform. cl_uint numberOfPlatforms = 0; if(clGetPlatformIDs(100, platform, &numberOfPlatforms) != CL_SUCCESS) { printf("OpenCL Error: Platform couldn't be found.\n"); return 0; } else { printf("%i platforms found.\n", numberOfPlatforms); } //Parameters needed to bind OpenGL's context to OpenCL's. cl_context_properties properties[] = { CL_GL_CONTEXT_KHR, (cl_context_properties) glXGetCurrentContext(), CL_GLX_DISPLAY_KHR, (cl_context_properties) glXGetCurrentDisplay(), CL_CONTEXT_PLATFORM, (cl_context_properties) platform[0], 0}; //Find openGL devices. typedef CL_API_ENTRY cl_int (CL_API_CALL *CLpointer)(const cl_context_properties *properties, cl_gl_context_info param_name, size_t param_value_size, void *param_value, size_t *param_value_size_ret); CL_API_ENTRY cl_int (CL_API_CALL *myCLGetGLContextInfoKHR)(const cl_context_properties *properties, cl_gl_context_info param_name, size_t param_value_size, void *param_value, size_t *param_value_size_ret) = (CLpointer)clGetExtensionFunctionAddressForPlatform(platform[0], "clGetGLContextInfoKHR"); size_t size; state = myCLGetGLContextInfoKHR(properties, CL_DEVICES_FOR_GL_CONTEXT_KHR, 100*sizeof(cl_device_id), devices, &size); if(state != CL_SUCCESS) { printf("OpenCL Error: Devices couldn't be resolved.\n"); return 0; } else { printf("%i devices support OpenGL/OpenCL interoperability.\n", (int)(size/sizeof(cl_device_id))); } //Create context. context = clCreateContext(properties, 1, &devices[0], NULL, NULL, &state); if(state != CL_SUCCESS) { printf("OpenCL Error: Context couldn't be created.\n"); return 0; } else { printf("Context succesfully created.\n"); } //Create VBO and grant read and write access to openCL. GLuint vbo; glGenBuffers(1, &vbo); glBindBuffer(GL_ARRAY_BUFFER, vbo); glBufferData(GL_ARRAY_BUFFER, 400, NULL, GL_STATIC_DRAW); cl_mem mem = clCreateFromGLBuffer(context, CL_MEM_WRITE_ONLY, vbo, &state); if(state != CL_SUCCESS) { printf("Error creating memory object from VBO! (CL_INVALID_GL_OBJECT = -60) and (state = %i)\n", state); } else { printf("Memory object created succesfully from VBO!\n"); } printf("GLCL finished. Bye!"); return 0; } Has anyone ever encoutered this issue and dealt with it successfully? I'm trying AMD forums to get a solution but so far I have found nothing useful.   I'm running this program in Ubuntu and the programming language is CPP, I've managed to get an OpenCL program working, the problem is getting OpenCL to acess OpenGL objects. Also, I'm using SDL to construct the OpenGL window.   Thanks and best regards!
  10. Basic question

    It's not the easiest way, however, using a radix tree provides a great deal of latitude handling string searches (i.e find the position of each word, how many times each word appear, how many words have a given prefix, etc...)   If you are managing very large documents with thousands of words, radix trees will greatly optimize the time you take to make string search operations compared to vectors.   I don't know if there's a radix tree library for C, but it's worth taking a look at. It's hard to understand but very easy to implement.
  11. Parallel programming question.

      Ok, I understand now.   Actually I wanted processes to have write access to memory space but I realised data can end up being corrupted even in this case if there's no synchronization.   Thanks for the help, Frob, Tribad!
  12. recv() questions

    It doesn't have anything to do with your question, but since you're excercising, you might want to use memcpy() instead of the for loop as it may be faster depending on the amount of data you're copying.   //copy to big buffer memcpy(buffer+total, inbuffer, sizeof(char)*result);   even better you could try:   total += recv(hSocket,buffer+total,bufferSize-total,0); which you end the need to copy from inbuffer to buffer at each iteration in the first place.
  13. Hello,   My question is related to OpenCL but I guess it could be applied to any parallel API (Cuda, MPI, pThread...)   I have a kernel that's executed thousands of times and each parallel kernel holds a reference to a single memory space called "__global char *comp". Prior to executing the kernels, the Host initializes this "comp" memory space to "true" and, during execution, kernels only acess "comp" if it needs to assign it's value to "false".   Do I have to worry about memory access synchronization in this particular case? I'm thinking it wouldn't matter if two kernels acessed the memory space concurrently as the only possible outcome would be for it to have a "false" value assigned.
  14. stack vs heap

      OO doesn't only make it much easier to represent coherent and understandable complex systems in an programming environment but also makes code maintenance a trivial task. What if you're months into development and decide you need to change something you did at the very beginning? How are you gonna find that portion of code, how will you make sure it doesn't affect the rest of the code.   Also, OO is essential to any kind of collaborative programming due to how it modularizes code, allowing a group of programmers to focus on developing his part of the project without having to worry about the rest of the code. There can't be a better example of how OO is important than game programming as it's a many people project, unless you're making Pong.
  15. stack vs heap

    You can test it yourself, start by declaring a small vector and keep increasing it's size.   float hum[100000]; float hum[1000000]; float hum[10000000];   Eventually you'll reach a "segmentation fault" error.   I can't ensure you of this fact, since I could have done some wrong compiler configuration or my O.S (Ubuntu) deals with memory management in a different way, so I'll leave to more experienced programmers to elaborate on this. I had this issue once and it was very hard to figure it out.   Let me know what yo find.