Jump to content
  • Advertisement


  • Content Count

  • Joined

  • Last visited

Community Reputation

189 Neutral

About sebjf

  • Rank

Personal Information

  • Role
  • Interests
  1. sebjf

    alternative to NVidia FX Composer?

    The last videos from ShaderFlex on their YouTube channel were posted ~3 years ago: https://www.youtube.com/channel/UCEljvElA7sJqMr_wy4Agdcw/videos Which would line up with their mention of the Oculus DK1 and DK2, but not CV1 or Vive. There is also ShaderToy, if you're happy to be limited to ~OpenGLES 2.0, though its always just crashed the browser on my home PC. I'd second Unity - its big, but it auto-compiles, has good compile time error reporting through the VS interop, and between the scene view and inspector material preview (where btw you can also tweak material parameters in real-time), gives nice visual feedback.
  2. Hi ajmiles, Unfortunately that introduces another race condition: Thread 1 Writes InterlockedMin with B, gets A in original_value, where B < A Thread 1 checks B < A Thread 2 Writes InterlockedMin with C (where C < B), gets B in original_value Thread 2 Writes Y to Positions Thread 1 writes X to Positions Positions now contains X, even though it should have Y, because between Thread 1 making the check and making the write, Thread 2 has written its value. I went with a simple solution: run the triangle-distance test twice - one pass generates the mask (as you suggest), the other checks it and writes the position. Efficiency is not important because this is a reference implementation against which I can debug faster algorithms, which hopefully won't need global sync to memory (which in itself is slow). It would be neat to know how to do a global sync within a shader, for interests sake though...
  3. Thanks for the replies! I have just found the problem though (its me being dumb and not reading what is in front of my eyes!). AllMemoryBarrierWithGroupSync() prevents race conditions by ensuring all threads within the group have reached the same place before passing the barrier. However, I am dispatching many groups, at least one per triangle, so this pattern will not work. (The buffer declaration would also be wrong even if it would, needing globallycoherent prefix)
  4. Hi, I am trying to brute-force a closest-point-to-closed-triangle-mesh algorithm on the GPU by creating a thread for each point-primitive pair and keeping only the nearest result for each point. This code fails however, with multiple writes being made by threads with different distance computations. To keep only the closest value, I attempt to mask using InterlockedMin, and a conditional that only writes if the current thread holds the same value as the mask after a memory barrier. I have included the function below. As can be seen I have modified it to write to a different location every time the conditional succeeds for debugging. It is expected that multiple writes will take place, for example where the closest point is a vertex shared by multiple triangles, but when I read back closestPoints and calculate the distances, they are different, which should not be possible. The differences are large (~0.3+) so I do not think it is a rounding error. The CPU equivalent works fine for a single particle. After the kernel execution, distanceMask does hold the smallest value, suggesting the problem is with the barrier or the conditional. Can anyone say what is wrong with the function? RWStructuredBuffer<uint> distanceMask : register(u4); RWStructuredBuffer<uint> distanceWriteCounts : register(u0); RWStructuredBuffer<float3> closestPoints : register(u5); [numthreads(64,1,1)] void BruteForceClosestPointOnMesh(uint3 id : SV_DispatchThreadID) { int particleid = id.x; int triangleid = id.y; Triangle t = triangles[triangleid]; float3 v0 = GetVertex1(t.i0); float3 v1 = GetVertex1(t.i1); float3 v2 = GetVertex1(t.i2); float3 q1 = Q1[particleid]; ClosestPointPointTriangleResult result = ClosestPointPointTriangle(q1, v0, v1, v2); float3 p = v0 * result.uvw.x + v1 * result.uvw.y + v2 * result.uvw.z; uint distance = asuint(length(p - q1)); InterlockedMin(distanceMask[particleid], distance); AllMemoryBarrierWithGroupSync(); if(distance == distanceMask[particleid]) { uint bin = 0; InterlockedAdd(distanceWriteCounts[particleid],1,bin); closestPoints[particleid * binsize + bin] = p; } }
  5. Hi raigan, here you go: https://www.researchgate.net/publication/266653351_A_FluidCloth_Coupling_Method_for_High_Velocity_Collision_Simulation Thank you coderchris, I will take a look! Right now I just use a series of basic swept-primitive-tests with a discrete stage run after to undo forced penetrations. I've never tried a history based version. Thanks Mike2343, I will have a look! Thanks swiftcoder, I love distance fields, but I am more interested in deformable models right now. I implemented 3d rasterisation into both (vector) distance fields and spatial hashes (w/ triangle collision detection) a while back, and found that BVH re-fitting and traversal on the GPU outperformed both. Spatial hashing performed equivalently to BVHs, but only in the mid-range (30k models and 13k cloth particles), at the extremes BVHs took the prize. (Also distance fields require something like ray-marching for ccd, which is more problematic given what I say last.) I won't go into detail unless someone asks, but suffice to say the BVH implementation was *a lot* more complex than the 3d rasterisations, needing multiple invocations on the CPU, multiple copies between buffers on the GPU (and I mean just copies), and loops in kernels with memory barriers - and it was still faster. I've continued with my implementation, putting the basic swept-sphere-primitive tests + dcd stage on the gpu. My 1080 runs ~2.2 million tests in about 0.5 ms, which is ok I guess. Much more interesting though was comparing the perf. with and without the broadphase. A basic AABB for all tris vs. all particles (the broadphase itself only pushing to append buffer) ran at about .25 ms. I know that GPUs nowadays in most cases are memory bound, but I have some sort of mental block and still manage to be surprised by it every time... The next stage is to implement interpolation/sharing of impulses like Bridson's, so the mesh responds to the collisions as well. Then once that is on the GPU (because it will impact bandwidth quite a bit) I'll run some more complete tests. It'll be interesting once that's done to see what swapping in BSC can do.
  6. Thanks for your insights both! Nothing other than I'd hoped there was something more performant. This is in fact what I have written for now: a Ray Triangle intersection for CCD with Closest-Point-On-Triangle for DCD. Interesting system! It sounds similar to Qian et al's circumsphere. I was thinking of trying something like this. I suspect this as well. When considering collision detection techniques though I always mentally compare them against the ideal (and most correct) triangle based system. The problem is this is always hypothetical (and I am bad at estimating performance), so I have decided to build a triangle based system for bench-marking. If it turns out fast enough to use that's a big bonus, as there are some advantages of triangle meshes over sphere/point based representations, such as better friction emulation. It didn't occur to me before that what is really needed for a penalty force based response (which I think is the nicest approach because it automatically handles numerical precision error and cloth offset/arbitrary particle size) is a moving-sphere-triangle test, as opposed to a point triangle test, so I've been reading about these. Since there doesn't seem to be a consensus on the best, I decided to do some performance tests. I picked a few algorithms and put them in a test rig. They were refactored to remove most of the early returns to better emulate behaviour on a GPU. The test rig was configured to generate 100000 rays or so cast into a volumne containing a triangle, then across multiple repeats with multiple particle sizes their execution times were measured. The algorithms I tested are below, not all of them were 100% working as this is preliminary, but they were working enough I am comfortable using the measurements as a guideline. 1. Moller-Trumbore with a distance offset. Used as the benchmark. The intersection distance is adjusted so its always a set distance above the surface of the triangle. Not a proper implementation because the ray must still hit the triangle. 2. Flip Code Implementation I found here: http://www.flipcode.com/archives/Moving_Sphere_VS_Triangle_Collision.shtml 3. Fauerby Fauerby's intersection test. This didn't work as written, possibly porting mistake. 4. Eberly Eberly's intersection test (https://www.geometrictools.com/Documentation/IntersectionMovingSphereTriangle.pdf). This didn't work as written as some of the pseudocode functions are missing. 5. Geometric My own implementation. Uses geometric tests against primitives approximating the Minkowski sum of the radius and triangle, as opposed to finding roots like most of the others (though cylinder intersection still requires it) The results (in microseconds, per ray) Algo Avg (us) StdErr (us) _____________________________________ _______ _________ MollerTriangleCollisionDiagnostics 0.76056 0.0019454 GeometricTriangleCollisionDiagnostics 2.29 0.0094226 FlipCodeTriangleCollisionDiagnostics 2.8112 0.010876 FauerbyTriangleCollisionDiagnostics 1.4425 0.0035103 EberlyTriangleCollisionDiagnostics 6.3413 0.019937 Interesting to start with. I am glad the Geometric solution is competitive, because conceptually its simple and more hackable than the root finding methods I feel anyway (I can sort of imagine how DCD could be integrated with it already). I think its worth trying to fix Fauerby's, being only half as slow as a basic ray-triangle intersection despite doing considerably more. Eberly's I won't pursue, given that I want a GPGPU implementation eventually and its not a fair comparison after I've gutted all its optimisations!
  7. Hello, I would like to implement continuous point-triangle collision detection as part of a cloth simulation. What robust implementations of this are used in practice? For example it is simple to use a ray-triangle intersection with q (particle start) and q1 (particle end) defining the ray relative to the triangle. Though I find this is not robust as the simulation can force the particles into an implausible state, which once it occurs even by the slightest margin, is unrecoverable. Over-correcting the particles helps somewhat, but introduces oscillations. This could be combined with a closest-point-to-triangle test that runs if the ray-intersection fails, but this seems to me very expensive and is essentially running two collision detection algorithms in one. Is there not a better way to combine them? I have searched for a long time this but most resources are concerned with improved culling, or improved primitive tests. I've found only one paper that specifically addresses combining CCD & DCD for point-triangle tests (Wang, Y., Meng, Y., Du, P., & Zhao, J. (2014). Fast Collision Detection and Response for Vertex / Face Coupling Notations. Journal of Computational Information Systems, 10(16), 7101–7108. http://doi.org/10.12733/jcis11492), which uses an iterative search with a thickness parameter. Is there anything beyond what I have described for this? Sj EDIT: I am familiar with Bridson, Du, Tang, et al. and the vertex-face penalty force computation, but I don't see how this is fundamentally different from the ray+closest-point test, other than the cubic formulation allows both the triangle and vertex to change in time. Though Du et al's Dilated Continuous Contact Detection seems like it should do both, so maybe I need to read it again..
  8. Hi richardurich, I think thats it. I first put a DeviceMemoryBarrier() call between the InterlockedMin() and a re-read of the value. That didn't work. Though it may be to do with clip() (I recall reading about the effect this had on UAVs and will see if I can find the docs). Then, I removed the test entirely and wrote a second shader to draw the contents of the depth buffer - and that appears to be very stable. I will see if I can get it to work for a test in a single shader. Though I could probably refactor my project to just use a single uav, which would be more efficient. Thank you very much for your insights. I have been working at that for 2 days! Sj
  9. The colour values are always  <=0 or >=1, I make sure of that by setting it when I made the test data. (I also check it in RenderDoc). Though currenly the shader is written as  float c_norm = clamp(col.x, 0, 1) * 100; uint d_uint = (uint)c_norm; uint d_uint_original = 0; just to be sure.   I am using the colour values to make this minimal example as they are easy to control. In my real project the masked value is more complex, but as can be seen bug occurs even with something as simple as vertex colours.     Yes that's right - it has three possible values: 0, 1 (from the fragments) or 0xFFFFFFFF, which is the initialisation value. I have confirmed this is the case using the conditional as well. Thats why I suspect its a timing issue rather than, say, reading the wrong part of memory or not binding anything, even though I can't fully trust the debugger. This is meant to be the absolute simplest case I can come up with that still shows the issue.
  10. Hi samoth, Yes it does, in this narrow case anyway - usually I use asuint() or as you say multiply by a large number then cast. Above I did a direct cast because it was easy to see which triangle wrote each value when checking the memory in RenderDoc. (I've tried all sorts of casts and scales to see if that was causing this issue however and none have an effect.)
  11. @richardurich I originally tried to upload it to the forum but kept receiving errors. I've added the relevant code though, as actually as you say its not too long. I can't see anything I could change in it though - e.g. calling InterlockedMin like a method as one would with a RWByteAddress type just results in a compiler error.
  12. Hi,   I am working in Unity, trying to create depth-buffer-like functionality using atomics on a UAV in a pixel shader.   I find though that it does not behave as expected. It appears as if the InterlockedMin call is not behaving atomically.   I say appears, because all I can see is that the conditional based on the original memory value returned by InterlockedMin does not behave correctly. Whatever causes incorrect values to be returned from InterlockedMin also occurs in the frame debugger - Unity's and RenderDoc - so when debugging a pixel it changes from under me!   By changing this conditional I can see that InterlockedMin is not returning random data. It returns values that the memory feasibly would contain, just not what should be the minimum.     Here is a video showing what I mean: https://vid.me/tUP8 Here is a video showing the same behaviour for a single capture in RenderDoc: https://vid.me/4Fir   (In that video the pixel shader is trying to use InterlockedMin to draw only the fragments with the lowest vertex colours encountered so far, and discard all others.)   Things I have tried: RWByteAddressBuffer instead of RWStructuredBuffer Different creation flags for ComputeBuffer (though since its Unity the options are limited and opaque) Using a RenderTexture instead of a ComputeBuffer Using the globallycoherent prefix Clearing the buffer in the pixel shader then syncing with a DeviceMemoryBarrier() call Clearing the buffer in the pixel shader every other frame with a CPU set flag Using a different atomic (InterlockedMax()) Using a different slot and/or binding calls   Here is the minimum working example that created those videos: https://www.dropbox.com/s/3z2g85vcqw75d1a/Atomics%20Bug%20Minimum%20Working%20Example.zip?dl=0   I can't think of what else to try, I don't see how the issue could be anything other than the InterlockedMin call, and I don't see what else in my code could affect it...   Below is the relevant fragment shader: float4 frag (v2f i) : SV_Target { // sample the texture float4 col = i.colour; float c_norm = clamp(col.x, 0, 1);    //one triangle is <=0 and the other is >=1 uint d_uint = (uint)c_norm; uint d_uint_original = 0; uint2 upos = i.screenpos * screenparams; uint offset = (upos.y * screenparams.x) + upos.x; InterlockedMin(depth[offset], d_uint, d_uint_original); if (d_uint > d_uint_original) { clip(-1);    //we havent updated the depth buffer (or at least shouldnt have) so don't write the pixel } return col; } With the declaration of the buffer being: RWStructuredBuffer<uint> depth : register (u1); And here is how the buffer is being bound and used: // Use this for initialization void Start () {         int length = Camera.main.pixelWidth * Camera.main.pixelHeight;         depthbufferdata = new uint[length];         for(int i = 0; i < length; i++)         {             depthbufferdata[i] = 0xFFFFFFFF;         }         depthbuffer = new ComputeBuffer(length, sizeof(uint)); } // Update is called once per frame void OnRenderObject () {         depthbuffer.SetData(depthbufferdata); // clears the mask. in my actual project this is done with a compute shader.         Graphics.SetRandomWriteTarget(1, depthbuffer);                          material.SetVector("screenparams", Camera.main.pixelRect.size);         material.SetPass(0);         Graphics.DrawMeshNow(mesh, transform.localToWorldMatrix); } Sj
  13. Hi,   I am working in Unity on pieces of shader code to convert between a memory address and a coordinate in a uniform grid. To do this I use the Modulo operator, but find odd behaviour I cannot explain.   Below is a visualisation of the grid. It simply draws a point at each gridpoint. The locations for each vertex are computed from the offset into the fixed size uniform grid. I.e. The vector cell is computed from the vertex shader instance ID, and this is in turn converted into NDCs and rendered.   I start with the naive implementation:   uint3 GetFieldCell(uint id, float3 numcells) { uint3 cell; uint  layersize = numcells.x * numcells.y; cell.z = floor(id / layersize); uint layeroffset = id % layersize; cell.y = floor(layeroffset / numcells.x); cell.x = layeroffset % numcells.x; return cell; }   And see the following visual artefacts:   [attachment=35344:modulo_1.PNG]   I discover that this is due to the modulo operator. If I replace it with my own modulo operation:    uint3 GetFieldCell(uint id, float3 numcells) { uint3 cell; uint  layersize = numcells.x * numcells.y; cell.z = floor(id / layersize); uint layeroffset = id - (cell.z * layersize); cell.y = floor(layeroffset / numcells.x); cell.x = layeroffset - (cell.y * numcells.x); return cell; }   The artefact disappears:   [attachment=35345:modulo_3.PNG]   I debug one of the errant vertices in the previous shader with RenderDoc, and find that it is implemented using frc, rather than a true integer modulo op, leaving small components that work their way into the coordinate calculations:   [attachment=35346:modulo_2.PNG]   So I try again:   uint3 GetFieldCell(uint id, float3 numcells) { uint3 cell; uint  layersize = numcells.x * numcells.y; cell.z = floor(id / layersize); uint layeroffset = floor(id % layersize); cell.y = floor(layeroffset / numcells.x); cell.x = floor(layeroffset % numcells.x); return cell; }   And it wor...! Oh... ...That's unexpected:   [attachment=35347:modulo_4.PNG]   Can anyone explain this behaviour?   Is it small remainders of the multiplication of the frc result with the 'integer' as I suspect? If not, what else? If so, why does surrounding the result with floor() not work? (Its not optimised away, I've checked it in the debugger...)   Sj
  14. Thanks MJP!   Do you know what MSDN meant by that line in my original post? It says 'resource' specifically, rather than view - but then the whole thing is pretty ambiguous.   Sj
  15. Hi,   I have a compute shader which populates an append buffer, and a another shader that reads from it as a consume buffer.    Between these invocations, I would like to read every element in the resource in order to populate a second buffer.     I can think of a couple of ways to do it, such as using Consume() in my intermediate shader, and re-setting the count of the buffer afterwards. Or, binding the resource as a regular buffer and reading the whole thing.    There doesn't seem to be a way to set the count entirely on the GPU, and its not clear if the second method is supported (e.g. "Use these resources through their methods, these resources do not use resource variables.").   Is there any supported way to read an AppendStructuredBuffer without decreasing its count?   Thanks!     (PS. Cross-post at SO: http://stackoverflow.com/questions/41416272/set-counter-of-append-consume-buffer-on-gpu)
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!