HateWork

Members
  • Content count

    27
  • Joined

  • Last visited

Community Reputation

171 Neutral

About HateWork

  • Rank
    Member

Personal Information

  • Interests
    Programming
  1. Vulkan Vulkan Resources

    I think it is very relevant to index the article from Matthew Wellings about "The new Vulkan Coordinate System": https://matthewwellings.com/blog/the-new-vulkan-coordinate-system/
  2. DX12 PSO Management in practice

    I've been there myself. I'm creating a library which is an abstraction layer for graphics APIs. So when I started the project I sat D3D11, D3D12 and Vulkan on a chair (each API in their own chair that is ;)) in the same table and asked them: "What are your common factors?" The first answer was much like what you are trying to do now: A D3D11 like pipeline state manager. I actually did it, BUT, it was very inefficient. There were so many PSOs to be internally created, state changes that had to be track and the lookup was adding to latency. I worked with this model for some time but as the project grown I had to recall the APIs again and ask the same question again. For the new answer I had to do as others suggested: I had to redesign my PSO common factor and went for a D3D12 like PSO manager. Literally I had to create an "IPipelineStateObject" interface. This new interface is natural for D3D12 and Vulkan to implement and is very easy for D3D11 to "emulate". This resulted in a much more native and efficient implementation. Not only for PSO, but in general, if you try to get a common factor between all available APIs then you'll notice that the model leans towards a Vulkan like design.
  3. Hello guys, I have a very simple D3D12 program that renders 2 textured quads and apply some saturation based on values passed to shaders by constant buffers. The textures (SRV) and the constant buffers (CBV) share a single large descriptor heap, SRVs are stored in the first half of the heap and CBVs in the second half. This program works just fine but then I extended it by adding more constant buffers to do some other work and noticed that all constant buffers that were created "after the textures were created", were getting corrupted in the descriptor heap and were not usable. The curious thing is that if I create the extra const buffers before the textures then it works, but this it is not a definitive solution because the code is part of a library and the users should be able create resources in whatever order they want. It doesn't matter if I change textures in the descriptor heap to the second half and buffers to the first half, it always fail, hence I think is something about corruption of the heap or bad alignment. I'm aware of the alignment requirements for const buffers: Just as shown in the SDK samples, I'm creating 64K sized buffers just to be sure and aligning them using "(data_size + 255) & ~255". But I have no clue on how to align textures. Do you guys have a suggestion about this topic or how to properly align resource views in a single heap? Thank you guys for your time. PS: This is what the graphics debugger reports, just in case:
  4. Hello guys,   I know that in Vulkan, a renderpass is responsible for shaping a frame's rendering flow. Whenever we change a renderpass we need to recreate framebuffers, pipeline state objects, (re-record) secondary command buffers, etc...   Is there a method for not having to re-record secondary command buffers when changing a renderpass? It seems unpractical to record a secondary command buffer (which is supposed to be a long life object) just to have it invalidated after a renderpass recreation which forces us to re-record the same commands for the new renderpass.   Thank you in advance.
  5. Well, after a couple of days I can confirm that there's no workaround for this, it simply can't be done in Vulkan.   I was trying to make Vulkan behave like D3D12, that's impossible. So I had to redesign my GFX interface and make D3D12 behave like Vulkan.   "Don't try to bend the spoon, that's impossible...instead try to bend yourself".
  6. cygan, wow, thank you very, very, very much for your time and your wise answer. I was pretty explicit in my post and you were able to understand my situation. I'm kind of shocked because I thought more people were going to jump to this sinking boat (a trivial Vulkan topic that almost everyone must come across)...but even more shocked by the fact that I think I got to a dead end with Vulkan...and I was just starting.   Actually I could try to create a new pipeline state object at every subpass with the corresponding subpass index...lets see if the validation layer complains. This is my last hope. I just wish Khronos, AMD and Nvidia could send their engineers to my home just like the do with big game studios ;)   Yep, different shaders and textures. I guess I must try this as well. But the problem is that in my library I'm using a "Graphics Driver Interface". It is an interface that all graphics APIs must follow. So far I've implemented D3D11 and D3D12 and part of OpenGL 4 succesfully. So, the user just draw using my library and the interface takes care of the internal stuff no matter the selected API. The problem is that Vulkan will not behave according to the general interface thus dragging down my gfx driver interface model. This is just a personal drawback but maybe if there is a Vulkan only design it could work by merging shaders and parametrizying.   In D3D12 this is nothing, I did this on day one, in fact it is designed to work in this flexible model. Vulkan at the contrary has to insist in the so called "predictability" introduced as a feature when in fact is a drawback to flexibility. I understand that Vulkan is an API that has to live for another 20 years like OpenGL and this must include support for tiler devices and even devices that hasn't been invented yet but support for some devices will exclude full optimization for another kind of devices. Some graphics chip manufacturers keep telling that a forward renderer could be optimized with render passes by letting the driver to guess what you're doing and then do it for you in the background, while in my personal opinion I prefer the driver to do exactly as I say and when I say it, that's basic optimization of resources per se...and I thought Vulkan' slogan was "full control of resources". Enough rant guys hehe. I just have to cool down and take it easy. I might drop Vulkan support for my library but then again might get back in a few years to check if this is supported in forward renderers (past, current and modern). In the meanwhile take care guys and keep developing games and your mind.
  7. Hello,   My question sounds simple: How to execute pre-recorded command buffers in Vulkan?   But we'll see that the answer is not as simple as: Use Vulkan's vkQueueSubmit for primary command buffers or vkCmdExecuteCommands for secondary command buffers.   I'm just starting with Vulkan and I'm creating a library for the user to issue drawing commands on a 2D canvas. The context here is that the user is able to create many command buffers. The key condition is that the user is able to execute the recorded command buffers in any order and there might be pipeline state changes between the execution of each of these command buffers.   So, lets start with the analysis:   As mentioned in the VK specs, all drawing commands must live within a render pass instance. But all the work in the render pass must be done in only one command buffer because we can not end the command buffer without ending the render pass first. So, this means that we can not have a command buffer to begin the render pass, put our pre-recorded command buffers in the middle, have another command buffer to end the render pass and finally send these 3 (or more) command buffers to the queue using vkQueueSubmit. The only alternative left is to use secondary command buffers. Secondary command buffers allow to execute command buffers within a primary command buffer. This sounds convenient because we already have a primary command buffer, the one that holds the render pass and the secondary command buffers will be the ones recorded by the user. The logical thing to do then is:   Begin the primary command buffer using vkBeginCommandBuffer. Put a memory barrier to set the framebuffer as render target (required for drawing) in the primary command buffer using vkCmdPipelineBarrier. Begin the render pass in the primary command buffer using vkCmdBeginRenderPass. Set up the initial pipeline state object in the primary command buffer using vkCreateGraphicsPipelines. Bind the initial pipeline state object in the primary command buffer using vkCmdBindPipeline. Call vkCmdExecuteCommands to execute the pre-recorded secondary command buffers inside the primary command buffer. End the render pass in the primary command buffer using vkCmdEndRenderPass. Put a memory barrier to set the framebuffer back to its original state (required for presenting) in the primary command buffer using vkCmdPipelineBarrier. End the primary command buffer using vkEndCommandBuffer. Submit the primary command buffer to the graphics queue using vkQueueSubmit. Present the frame using vkQueuePresentKHR. Unfortunately this won't work because when you begin a render pass in a command buffer, all the other commands in the command buffer between the vkCmdBeginRenderPass and vkCmdEndRenderPass calls must be either inlined (VK_SUBPASS_CONTENTS_INLINE) which allows execution of any commands in the primary command buffer like vkCreateGraphicsPipelines and vkCmdBindPipeline for example, or must be grouped into secondary buffers (VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS). The problem is that in the example above we have both: Inlined commands and secondary commands, this is out of spec and if commands are inlined (first flag) then secondary commands can not be executed but if we specify the secondary flag then no inlined commands can be executed (all inlined commands are discarded except for vkCmdExecuteCommands which is used for the execution of secondary command buffers).   How do I deal with this?   Remember that we can not move inlined commands to the secondary command buffers because these are pre-recorded by the user so when a pipeline state change occurs between two secondary command buffers we can not inject vkCmdBindPipeline at the beginning of the secondary buffer.   The condition about exclusion between inlined and secondary commands in a render pass is canceled when either a call to vkCmdEndRenderPass is made or a call to vkCmdNextSubpass is made. Taking attention to the second call gives me the idea to use subpasses. I've never used subpasses and do not know yet how they work but if I think about it, I'll have to use the first subpass in the render pass to set up the initial pipeline state then use another subpass to execute my secondary command buffers then if a pipeline state change occurs I'll have to use another subpass to update the pipeline state and then finally use another subpass to continue executing my secondary command buffers.   Is this the way it is supposed to execute pre-recorded command buffers in Vulkan?   All the examples I've come across record a single primary command buffer every frame and do not expose this situation. This is a trivial subject but Vulkan has made it all different with the introduction of render passes.   Thank you guys for your time. Any help to all the Vulkan beginners will be much appreciated by all of us.
  8. Ok, so i'm back here to report my progress. I finished the implementation of my "command serializer" concept and it ended up pretty damn good! I did some basic testing and here are the results:   NVIDIA GTX 750 Ti (v-sync off): [Default MS Implementation] 3317 fps average 30.6 MiB (RAM) [Command Serializer] 3616 fps average with spikes up to 3850 fps 30.2 MiB (RAM)   I run the tests many times and results were the same. The "Default MS Implementation" means that commands that reference the backbuffer are recorded every frame in a dedicated command list for this purpose and my normal commands are recorded once in their own command lists. The serializer method needs more testing under different scenarios to see how it behaves but so far it has been doing good for command lists that are prerecorded once. It works perfect for every type of commands, i can reference any backbuffer at any moment and mix them between normal commands. What's next? I want to code two more solutions and publish the results: The "intermediate RTV" and also the more common "one command list per backbuffer", the latter one I thought it would use too much memory because I thought vertex buffer data and other resource data was cached by command lists but I think now that they doesn't, this should make this solution the preferred one because it would be standard, lightweight and faster. Lets wait for the results.
  9. Hello guys, thank you for your interest in this topic.   To begin with, i must say that if any of you don't understand the problem then it is very easy to reproduce. Simply grab the most basic example in the D3D12 SDK, the "HelloWindow" example and move line 162 (a call to the function to populate the command list) to line 151 (at the end of the function to load assets). What you're doing here is recording the command list at initialization once and then executing it at every frame. If you compile and execute the program it is going to run, clean the first frame correctly but then in the next frame the command list will reference the previous backbuffer and it will crash. I've attached to this reply a ZIP file with the C++ source file and the compiled program, try it.   Now i'll answer some fragments of this topic:     I'm not resetting my lists because i have too many commands and by doing the prerecording model i'm saving CPU time. This may not gain performance in the GPU side as you say but will compensate when doing heavy work in the CPU.     Bundles have no effect different to direct command lists regarding the backbuffer index issue. The problem persists even with bundles.     This is a good idea. Create a "ID3D12Resource" and a handle to it, use it as a render target for all my commands and then copy the whole region to the current backbuffer. It sounds great, sure it will require memory for the frame buffer but its just a routine worth the sacrifice (and not that much memory anyway, depends on the resolution, 4k omg). Entire frame buffer copies are expensive but again are dependent on resolution, i wonder how the performance will be affected and how it will be scaled based on resolution. I'll have to elaborate more on the subject as i made an implementation for it. Thanks for the advice, i'll have to try this.   I also thought about creating a command list for each backbuffer but that would be 100+ commands per list for each buffer. This would completely solve the execution problem and it would allow me to write directly to the backbuffer but it would introduce memory usage by a lot (seriously, i'm precaching too many commands across many lists). To counter the memory usage i was thinking about branching my command lists using linked lists. The structure used for the linked list can specify if my command lists are "normal" type or a "backbuffer reference" type. The normal types would only utilize one command list and the other type would use FRAME_COUNT command lists (which can be optimized by creating them as bundles). This way when composing the final array of command lists that are going to be submitted to the command queue i can create an infinite branch of mixed normal and backbuffer reference types. This is my concept: struct CommandLink {     uint8_t type = 0; // 0 = normal (use m_command_list[0]), 1 = backbuffer reference (use m_command_list[0 to FRAME_COUNT - 1]).     ComPtr<ID3D12GraphicsCommandList> m_command_list[FRAME_COUNT];     CommandLink* next = nullptr;       // Note that this structure can be extended or optimized using unions. };   And this can be an example branch: 1 - Normal [0]            | 2 - Backbuffer reference [0-(FRAME_COUNT - 1)]            | 3 - Normal [0]            | 4 - Normal [0] (i can do this but two normal types can be merged together for better performance)            | 5 - Backbuffer reference [0-(FRAME_COUNT - 1)]            | 6 - Normal [0]            | 7 - Backbuffer reference [0-(FRAME_COUNT - 1)]   EDIT: Actually this is more like serializing command lists rather than branching them. Also this can be done with arrays instead of linked lists.   This could sound like an overthought concept but i'm guessing that it will have low memory usage and good performance compared to the intermediate RTV solution. I'll also have to code something like this to see how it goes.   Well, this has gone long enough. I'll try to post my results for the 2 solutions but i'll need some time. Also this has somehow turned to something fun to me. I'm really liking D3D12 a lot, it is flexible enough allowing you to do anything you want, even crash your program on purpose.   Cheers guys, take care.
  10. Hello guys,   I'm coding a simple D3D12 program and have many command lists with hundreds of prerecorded commands (commands are recorded once at initialization and never reset again). The problem is that commands that reference the backbuffer can not be recorded because i'm using triplebuffering and when a command recorded for the current backbuffer is executed on the next frame, the program hangs. For example i can't do something like this (i can record it but can't execute it without hanging): m_command_list->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_render_targets[m_frame_index].Get(), D3D12_RESOURCE_STATE_PRESENT, D3D12_RESOURCE_STATE_RENDER_TARGET)); m_command_list->OMSetRenderTargets(1, &m_rtv_handle[m_frame_index], false, nullptr); m_command_list->ClearRenderTargetView(m_rtv_handle[m_frame_index], clearColor, 0, nullptr); m_command_list->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_render_targets[m_frame_index].Get(), D3D12_RESOURCE_STATE_RENDER_TARGET, D3D12_RESOURCE_STATE_PRESENT)); This totally breaks my prerecording model. One thing would have to get the handle to the current backbuffer and record a separate command every frame. Another thing would be to prerecord a set of command lists (one for each backbuffer) with the commands in the example above and execute the corresponding one before or after my other prerecorded command lists (the ones with draw submissions), but what if i'd like to set a resource barrier or clear the backbuffer in the middle of my command lists? (it makes no sense to clear the backbuffer in the middle of a frame but is just an example). In D3D11 it was easy to do this with deferred contexts when creating the swap chain with the swap effect as DXGI_SWAP_EFFECT_DISCARD because the current writeable backbuffer was only accesible through index 0. In D3D12 i can not even set the backbuffer count below 2, no matter what swap effect i'm creating the swap chain with. Do you guys have a programing model to overcome this?  
  11. S3 MeTaL SDK

    Hi, I have an old Savage4 gfx card and wanted to include support for the S3 Metal API. I've searched the SDK from land to sea and haven't found anything related to it. Is there any document, help or starting site? i wonder how all those old engines got started with Metal. I already added support for Glide (the sdk and drivers still exists), althought Glide was much more popular than Metal but there's have to be something out there in the net. Thanks in advance if someone can point me in the right direction.
  12. I'm back here to post my results. Basically what i did: I created my own Z-Test/Z-Buffer that works on regions instead of pixels and it runs on older hardware and older DX versions. It runs amazingly in 2D scenarios and i also added 3D support (although i haven't tested it yet). I used the steps that i explained above, with special efforts in grouping objects with similar attributes. RESULTS: It works great!!! even better than the hardware z-test performed by the GPU. I'm back to my 235/248 fps and with z-depth implemented. I have no performance loss and only gain FPS when an object is totally occluding other objects (about 2-5 fps per object though). Again, i put special efforts in grouping objects for batching, this allows less texture and index buffer changes. My graphics adapter is an old and very basic, the most basic in its series (there can't exists a lesser model). It's an integrated, low power mobile nVidia GeForce 7000M. With this adapter i get about 2350 fps with a blank screen, 850 fps with the graphics in blank and the rest of the engine (input, sound, network, etc) and the application logic running. And finally i get my 235/248+ when throwing everything at it. Please tell me right now if these numbers seems to be wrong, i haven't had the chance to compare against others implementations. I wonder what am i going to be able to accomplish when running my engine on newer top end hardware. I'm excited about this. Looks like z-testing and z-buffers in hardware are really expensive features, even with newer hardware. I recommend it (not really) if you're using it for 3D content (meshes) in a very deep 3D space and don't wan't to mess with complex code. But, for 2D and specially quads, per-pixel z-testing is not needed, at least not in the way HW performs it. A very well crafted custom Z implementation seems to outperform in all scenarios. This thread is more about Z performance than alpha blending techniques. Take my post above as a huge advice. Cheers.
  13. Thanks MJP for the reply. If i disable z-writes then the zbuffer does not work at all. How can i leave enabled only the test? (just in case) I'm using d3d9, clearing the z-buffer and setting it to 0, the zfunc is greater or equal. I'm taking your advice and will rewrite the drawing/sorting algoritm to something like this (i'll review it in case someone needs something like this too, i have implemented something quite similar and it's giving me great performance for 2D content with z-buffer off). So, here's my secret: First will try with Z off (because i get more performance when drawing 2D). Then i'll try enabling it and changing draw order. One big VB per FVF (one lock per frame). No additional SetStreamSource. Smaller fixed IB per geometry. One lifetime write. Fewer additional SetIndices. I'll code a drawing time line. Items will be added to this line in the order they were created. If an item has the same geometry or is similar than an item back in the line AND if this item does not occlude (partially or totally) other items in the line then these objects are batched together. If an item occludes entirely other item(s) and it hasn't alpha then delete occluded items from the line (this is perhaps an efficient z-occlude test by cpu, at least it performs the test based in an entire region other than per pixel as it is done in the gpu). If an item has the same geometry to be batched with other items but it is occluding partially or is occluding totally but has alpha then this item can not be batched and has to remain foward in the line. This is the main idea. I hope this helps someone writing a D3D based library/program from scratch to draw 2D. It doesn't sacrifice performance nor mess with z-buffer, scissor test and stenciling. My actual implementation gives me at maximum 235/248 fps at medium-high content load (only 2D), with many texture changes and with an older and motherboard integrated hardware. I'll try this new implementation and post the results. In the meanwhile any z-buffer optimizations would be appreciated.
  14. Hi there, i've already searched for this in the forums and it's not there so i can post the situation here. I've made an ilustration for easy understanding of the problem. It's quiet simple, direct and it won't take you much time reading and understanding. I implemented very easily the z-buffer. It sorts all of my objects as i desire. BUT the performance went to the bottom. With a very low end device and before z implementation i was having 204 fps and now with z enabled y have 114, this makes a huge diference. I'm not using all of these fps i just need 45/55, but all of those remaining fps are reserved for the application, so, the more fps the best right? Now, reading the SDK, it says that when using the z-buffer, performance can be achieved by drawing objects from front to back, that sounds reasonably. Ok, i did that and from 114 fps i went to 128, at least something. The real problem is that i'm drawing textured quads (fixed pipeline functions) and when a texture contains an alpha channel with transparent/semitransparent areas i get the following: The scenario is: Left side, we have 2 textured quads: A soccer ball with transparent and semitransparent areas. Also we have the second quad wich is an opaque square. The ball is drawn first always. The alpha blending is working fine as we can see the ball is perfectly merged with light blue background. Right side, we draw from front to back and sucessfully put the ball in front of the square but note the corners and surrounding borders of the ball there should be red pixels from the square. It's totally normal and i understand why this happends. The z test and alphablending are working as they should. This happends because the ball is drawn first and get the pixel colors from what is behind and in this case is the blue background, then when the square is drawn in the same coords as the ball, the pixels sharing the same space with the ball's rect are not drawn because of the z.test doing it's job. The question is quite obvious: How can i draw the square behind the ball and keep the missing red pixels visible? An obvious answer would be: Easy! in this case draw the square first and the ball at the end. Well, that's not posible because in my application, manual sort is not posible. The square n ball is just an example, other multiple objects are drawn in a batch. If i draw in reverse order all of the objects it works fine but what's the point then of this post. Another question: If i forget about this front to back thing to get rid of this alpha issue and draw normally from back to front. Is there another way to get more performance when z-buffer is enabled, tips? Thanks in advance and i hope you liked my picture ;).
  15. Stencil Buffer to crop

    Well, well. This succeded topic didnt't last too long. I just found out a bug with the scissor test. There's a problem with indexed quads. I was all time drawing indexed quads but by coincidence i was just drawing different quads. BUT when drawing indexed quads from the same vertex and index buffer that have the same geometry, the scissor rectangle applies to all quads in the batch and not individually. So, Buckeye's idea of resizing quads and u,v coords is at the moment the best implementation.