• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.

JoeJ

Members
  • Content count

    584
  • Joined

  • Last visited

Community Reputation

2540 Excellent

About JoeJ

  • Rank
    Advanced Member
  1. You can reproject the depth buffer from the previous frame. (There should be an article here on gamedev.net. Quite a lot games use GPU occlusion queries. IIRC Cryengine uses it and there should be a paper.) You can also render occluders for the current frame at first and do frustum and occlusion culling on GPU to avaid a read back. Personally i implented a software approach using low poly occluders: Put occluders, batches of world geometry, character bounding boxes etc. in an octtree and render coarsely front to back. Raster the occluders to a framebuffer made of span lists (so no heavy per pixel processing). Test geometry BBox against framebuffer and append to drawlist if it passes. Advantage: If you are in a room, not only geometry but also occluders behind walls will be rejected quickly because the octree BBox already fails the visibility test and the whole branch gets terminated. Very work efficient. Can cover dynamic scenes or open / closed doors. Disadvantage: Because the whole system relies on early termination parallelization makes no sense. Unfortunately i don't know in which situations my method can beat others because i did no comparisions, but you can think of it. I also remember a paper or article about how an older version of Umbra worked, but can't give any link. They managed to use something like a low resolution framebuffer by rastering portals instead occluders to ensure correctness. The diffulicty is to extract those portals as a preprocess... IIRC For line of sight tests you sould use simple raytracing. A full blown visibility determination is demanding even for a single camera and no good option for NPC vs. Player even if first person. I don't think any recent game still uses the same system for collision detection and visibility - Quake was really a special case here.
  2. Occlusion Culling or Visibilty Determination might be good terms to search. Since the old times there is hardware occlusion tests on GPU which is new: You can render low poly occluders to a low resolution z-Buffer, build a mip-map pyramid and test bounding boxes against that. This approach can work for dynamic stuff and of course you can do it in software on CPU as well. Probably artists need to make the occluders by hand.
  3. OpenGL

    Here are some options how to store directional ambient light, any of them can be called 'probe': One color for each side of a cube, so 6 directions and you interpolate the 3 values fitting a given normal. (Known as Valves ambient cube used in HL2) Spherical Harmonics. The number of bands you use define the detail, 2 or 3 bands (12 / 27 floats) is enough for ambient diffuse. Cube maps. Enough details for reflections (used a lot for IBL / PBR today). Lowest LOD == ambient cube. Dual paraboloid maps (or two sphere maps). Same as cube maps, but needs only 2 textures instead 6. Independent of the data format you choose there remains the question how to apply it to a given sample position. some options: Bounding volume set manually by artist, e.g. a sphere or box with a soft border: You find all affecting volumes per sample and accumulate their contribution. Uniform grid / multiresolution grids: You interpolate the 8 closest probes. E.g. UE4 light cache. Voroni Tetrahedronilaztion: You interpolate closest 4 probes (similar to interpolating 3 triangle vertices by barycentric coords for the 2D case). AFAIK Unity uses this. Also the automatic methods often require manual tuning, e.g. a cut off plane to prevent light leaks through a wall to a neighbouring room. Notice a directionless ambient light used in Quake does not support bump mapping. The 'get ambient from floor' trick works well only for objects near the ground. There are probably hundrets of papers talking about details.
  4. Maybe these: https://github.com/derkreature/IBLBaker https://github.com/dariomanesku/cmftStudio
  5. Oh sorry, yes. Confused this with the dstStageMask from vkCmdPipelineBarrier()
  6. Copy that - it's good to compare your own guessing against the guessing of others (I spot you don't cover the case of a compute shader writing indirect dispacth count.) I wonder if Events could help here: http://vulkan-spec-chunked.ahcox.com/ch06s03.html I have not used them yet. Could i do something like triggereing a memory barrier, processing some other work, waiting on the barrier with a high chance it has been done already? I really need Vulkan for dummies that tells me some usecases of such things...
  7. Ok, so finally and as expected it makes no difference with these options: Use barriers for unique buffers per task. Use barriers for nonoverlapping memory regions per task but the same buffer for all. The driver could figure out to still use async with 1 queue in both cases, but it does not. Just like the specs say. I hope i've set up everything correctly (still unsure about the difference of VK_ACCESS_MEMORY_WRITE_BIT and VK_ACCESS_SHADER_WRITE_BIT, but this did not matter). So the conclusion is: We have to use multiple queues to keep busy on pipeline barriers. We should reduce sync between queues to a minimum. A bit more challenging than initially thought and i hope 2 saturating tasks in 2 queues don't slow each other down too much. If so we need more sync to prevent this and it becomes a hardware dependent act of balancing. But i'm optimisitc and it all makes sense now.
  8. BufferMemoryBarriers, here's code. I leave comments in to illustrate how poor and uncertain the specs leave us at trial and error - or would you get the idea you need to set VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT for an indirect compute dispatch? (Of course i could remove this here as i'm only writing some prefix sum results and no dispatch count, but offset and size becomes interesting now...) void MemoryBarriers (VkCommandBuffer commandBuffer, int *bufferList, const int numBarriers) { int const maxBarriers = 16; assert (numBarriers <= maxBarriers); VkBufferMemoryBarrier bufferMemoryBarriers[maxBarriers] = {}; //VkMemoryBarrier memoryBarriers[maxBarriers] = {}; for (int i=0; i<numBarriers; i++) { bufferMemoryBarriers[i].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER; //bufferMemoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT; //bufferMemoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT; bufferMemoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT; bufferMemoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT; //bufferMemoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT | VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT; //bufferMemoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT | VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT; bufferMemoryBarriers[i].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED; bufferMemoryBarriers[i].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED; bufferMemoryBarriers[i].buffer = buffers[bufferList[i]].deviceBuffer; bufferMemoryBarriers[i].offset = 0; bufferMemoryBarriers[i].size = VK_WHOLE_SIZE; //memoryBarriers[i].sType = VK_STRUCTURE_TYPE_MEMORY_BARRIER; //memoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT;// | VK_ACCESS_SHADER_WRITE_BIT; //memoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_READ_BIT;// | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT; } vkCmdPipelineBarrier( commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT | VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT, 0,//VkDependencyFlags 0, NULL,//numBarriers, memoryBarriers,// numBarriers, bufferMemoryBarriers, 0, NULL); } void Record (VkCommandBuffer commandBuffer, const uint32_t taskFlags, int profilerStartID, int profilerStopID, bool profilePerTask = true, bool use_barriers = true) { VkCommandBufferBeginInfo commandBufferBeginInfo = {}; commandBufferBeginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO; commandBufferBeginInfo.flags = 0;//VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT; vkBeginCommandBuffer(commandBuffer, &commandBufferBeginInfo); #ifdef USE_GPU_PROFILER if (profilerStartID>=0) profiler.Start (profilerStartID, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT); #endif if (taskFlags & (1<<tTEST0)) { vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST0], 0, 1, &descriptorSets[tTEST0], 0, nullptr); vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST0]]); #ifdef PROFILE_TASKS if (profilePerTask) profiler.Start (TS_TEST0, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT); #endif int barrierBuffers[] = {bTEST0}; for (int i=0; i<TASK_COUNT_0; i++) { vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (0 + i) ); if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1); } #ifdef PROFILE_TASKS if (profilePerTask) profiler.Stop (TS_TEST0, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT); #endif } if (taskFlags & (1<<tTEST1)) { vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST1], 0, 1, &descriptorSets[tTEST1], 0, nullptr); vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST1]]); #ifdef PROFILE_TASKS if (profilePerTask) profiler.Start (TS_TEST1, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT); #endif int barrierBuffers[] = {bTEST1}; for (int i=0; i<TASK_COUNT_1; i++) { vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (200 + i) ); if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1); } #ifdef PROFILE_TASKS if (profilePerTask) profiler.Stop (TS_TEST1, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT); #endif } if (taskFlags & (1<<tTEST2)) { vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST2], 0, 1, &descriptorSets[tTEST2], 0, nullptr); vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST2]]); #ifdef PROFILE_TASKS if (profilePerTask) profiler.Start (TS_TEST2, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT); #endif int barrierBuffers[] = {bTEST2}; for (int i=0; i<TASK_COUNT_2; i++) { vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (400 + i) ); if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1); } #ifdef PROFILE_TASKS if (profilePerTask) profiler.Stop (TS_TEST2, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT); #endif } #ifdef USE_GPU_PROFILER if (profilerStopID>=0) profiler.Stop (profilerStopID, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT); #endif vkEndCommandBuffer(commandBuffer); }
  9. Some interesting points, i made this test now: 1: 10 dispatches of 50 wavefronts 2: 50 dispatches of 50 wavefronts 3: 50 dispatches of 50 wavefronts With memory barrier after each dispatch and 1 queue: 0.46 ms With memory barrier after each dispatch and 3 queues, one per shader: 0.21 ms So we can use multiple queues to keep working while another queue is stalled. I'll modify my test to see if i still could one queue for the same purpose by setting the memory ranges within the same buffer per shader, or by using multiple buffers per shader... EDIT1: ...but first i tried to make the frist shader 5 times more work than 2 & 3. Actually all shaders did the same calculations, so i cant be sure a barrier on queue 1 does not stall queue 0 as well because barriers happen at the same time. Now i see shader 1 still completes first and is slightly faster than the other two so it is not affected by their barriers Runtime 3 queues: 0.18ms, 1 queue: 0.44ms (not the first time seeing doing more work is faster on small loads)
  10. I think using just enum and not enum class is the better option for bitfields, sadly. But i'm curious what comes up here...
  11. If the work is truly independent, you won't have any barriers and could use one queue just fine. Yes, using one queue is faster even if there are no memory barriers / semaphores. Submitting a command buffer has a noticeable cost, so putting all work in one command buffer and one queue is the fastest way. I also tested to use queues from different families vs. using all from one family which had no effect on performance. All tests with only compute shaders. Now i don't see a need to use multiple queues other than for up / downloading data. Maybe using 2 queues if we want to do compute and graphics makes sense but i guess 1 queue is better here too. Edit: Maybe using multiple queues results in dividing work strictly between CUs, while using one queue can distribute multiple dispatches on the same CU. If so, maybe we could avoid some cache thrashing my grouping work with similar memory access together. But i guess cases where this wins would be extremely rare.
  12. I tried to verify with my test... I have 3 shaders: 1: 10 dispatches of 50 wavefronts 2: 50 dispatches of 50 wavefronts 3: 50 dispatches of 50 wavefronts With memory barrier after each dispatch i get 0.462 ms Without: 0.012 ms (speed up of 38.5) To verify i use 1 dispatch of 5500 wavefronts (same work): 0.013 ms So yes, not only the GPU is capable of doing async compute perfectly with a single queue, we also see to API overhead of multiple dispatches is zero Finally i understand why memory barriers appeared so expensive to me. Shame on me and all disappointment gone
  13. Whooo! - I already thaught the driver could figure out a dependency graph und do things async automatically, but i also thought this being reality would be wishfull thinking. This is too good to be true, so i'm still not ready to believe it (Actually i have too much barriers, but soon i'll be able to push more independent work to the queue and i'm curious if i'll get a lot of it for free...) Awesome! Thanks, Hodgman
  14. Are you sure about that? AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler. I would have nothing aginst the queue concept, if only it would work. You can look at a my testproject i have submitted to AMD: https://github.com/JoeJGit/OpenCL_Fiji_Bug_Report/blob/master/async_test_project.rar ...if you are bored, but here is what i found: You can run 3 low work tasks without synchornizition perfectly parallel, yeah - awesome. As soon as you add sync, which is only possible by using semaphores, the advantage gets lost due to bubbles. (Maybe semaphores sync with CPU as well? If so we have a terrible situation here! We need GPU only sync between queues.) And here comes the best: If you try larger workloads, e.g. 3 tasks with runtimes of 0.2ms, 1ms, 1ms without async, going async the first and second task run parallel as expected, although 1ms become 2ms, so there is no win. But the third task raises to 2ms as well, even it runs alone and nothing else - it's runtime is doudled for nothing. It seems there is no dynamic work balancing happening here - looks like the GPU gets divided somehow and refuses to merge back when possible. Guess not, the numbers don't match. A Fiji has 8 ACEs (if thats the correct name), but i see only 4 compute queues (1gfx/CS+3CS). Nobody knows what happens under the hood, but it needs more work, at least on the drivers. Access to unique CUs should not be necessary, you're right guys. But i would be willing to tackle this if it would be an improvement. There are two situations where async compute makes sense: 1. Doing compute while doing ALU light rendering work (Not yet tried - all my hope goues in to this, but net everyone has rendering work.) 2. Paralellizing and synchronizing low work compute tasks - extremely important if we look towards more complex algotrithms reducing work instead to brute force everything. And sadly this fails yet.
  15. Probably. I'm no graphics pipeline expert, but i'm not aware of a case where using two graphics queues can make sense. (Interested, if anybody else does) It also makes sense to use 1 graphics queue, 1 upload queue, 1 download queue on each GPU to communicate (although you don't have this option because you have only one seperate transfer queue). And it makes sense to use multiple compute queues on some Hardware. I proofed that GCN can perfectly overlap low work compute workloads, but the need to use multiple queues, so multiple command buffers and to sync between them destroyed the advsante in my case. Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start...