• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By fleissi
      Hey guys!

      I'm new here and I recently started developing my own rendering engine. It's open source, based on OpenGL/DirectX and C++.
      The full source code is hosted on github:

      I would appreciate if people with experience in game development / engine desgin could take a look at my source code. I'm looking for honest, constructive criticism on how to improve the engine.
      I'm currently writing my master's thesis in computer science and in the recent year I've gone through all the basics about graphics programming, learned DirectX and OpenGL, read some articles on Nvidia GPU Gems, read books and integrated some of this stuff step by step into the engine.

      I know about the basics, but I feel like there is some missing link that I didn't get yet to merge all those little pieces together.

      Features I have so far:
      - Dynamic shader generation based on material properties
      - Dynamic sorting of meshes to be renderd based on shader and material
      - Rendering large amounts of static meshes
      - Hierarchical culling (detail + view frustum)
      - Limited support for dynamic (i.e. moving) meshes
      - Normal, Parallax and Relief Mapping implementations
      - Wind animations based on vertex displacement
      - A very basic integration of the Bullet physics engine
      - Procedural Grass generation
      - Some post processing effects (Depth of Field, Light Volumes, Screen Space Reflections, God Rays)
      - Caching mechanisms for textures, shaders, materials and meshes

      Features I would like to have:
      - Global illumination methods
      - Scalable physics
      - Occlusion culling
      - A nice procedural terrain generator
      - Scripting
      - Level Editing
      - Sound system
      - Optimization techniques

      Books I have so far:
      - Real-Time Rendering Third Edition
      - 3D Game Programming with DirectX 11
      - Vulkan Cookbook (not started yet)

      I hope you guys can take a look at my source code and if you're really motivated, feel free to contribute :-)
      There are some videos on youtube that demonstrate some of the features:
      Procedural grass on the GPU
      Procedural Terrain Engine
      Quadtree detail and view frustum culling

      The long term goal is to turn this into a commercial game engine. I'm aware that this is a very ambitious goal, but I'm sure it's possible if you work hard for it.


    • By tj8146
      I have attached my project in a .zip file if you wish to run it for yourself.
      I am making a simple 2d top-down game and I am trying to run my code to see if my window creation is working and to see if my timer is also working with it. Every time I run it though I get errors. And when I fix those errors, more come, then the same errors keep appearing. I end up just going round in circles.  Is there anyone who could help with this? 
      Errors when I build my code:
      1>Renderer.cpp 1>c:\users\documents\opengl\game\game\renderer.h(15): error C2039: 'string': is not a member of 'std' 1>c:\program files (x86)\windows kits\10\include\10.0.16299.0\ucrt\stddef.h(18): note: see declaration of 'std' 1>c:\users\documents\opengl\game\game\renderer.h(15): error C2061: syntax error: identifier 'string' 1>c:\users\documents\opengl\game\game\renderer.cpp(28): error C2511: 'bool Game::Rendering::initialize(int,int,bool,std::string)': overloaded member function not found in 'Game::Rendering' 1>c:\users\documents\opengl\game\game\renderer.h(9): note: see declaration of 'Game::Rendering' 1>c:\users\documents\opengl\game\game\renderer.cpp(35): error C2597: illegal reference to non-static member 'Game::Rendering::window' 1>c:\users\documents\opengl\game\game\renderer.cpp(36): error C2597: illegal reference to non-static member 'Game::Rendering::window' 1>c:\users\documents\opengl\game\game\renderer.cpp(43): error C2597: illegal reference to non-static member 'Game::Rendering::window' 1>Done building project "Game.vcxproj" -- FAILED. ========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========  
      #include <GL/glew.h> #include <GLFW/glfw3.h> #include "Renderer.h" #include "Timer.h" #include <iostream> namespace Game { GLFWwindow* window; /* Initialize the library */ Rendering::Rendering() { mClock = new Clock; } Rendering::~Rendering() { shutdown(); } bool Rendering::initialize(uint width, uint height, bool fullscreen, std::string window_title) { if (!glfwInit()) { return -1; } /* Create a windowed mode window and its OpenGL context */ window = glfwCreateWindow(640, 480, "Hello World", NULL, NULL); if (!window) { glfwTerminate(); return -1; } /* Make the window's context current */ glfwMakeContextCurrent(window); glViewport(0, 0, (GLsizei)width, (GLsizei)height); glOrtho(0, (GLsizei)width, (GLsizei)height, 0, 1, -1); glMatrixMode(GL_PROJECTION); glLoadIdentity(); glfwSwapInterval(1); glEnable(GL_SMOOTH); glEnable(GL_DEPTH_TEST); glEnable(GL_BLEND); glDepthFunc(GL_LEQUAL); glHint(GL_PERSPECTIVE_CORRECTION_HINT, GL_NICEST); glEnable(GL_TEXTURE_2D); glLoadIdentity(); return true; } bool Rendering::render() { /* Loop until the user closes the window */ if (!glfwWindowShouldClose(window)) return false; /* Render here */ mClock->reset(); glfwPollEvents(); if (mClock->step()) { glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); glfwSwapBuffers(window); mClock->update(); } return true; } void Rendering::shutdown() { glfwDestroyWindow(window); glfwTerminate(); } GLFWwindow* Rendering::getCurrentWindow() { return window; } } Renderer.h
      #pragma once namespace Game { class Clock; class Rendering { public: Rendering(); ~Rendering(); bool initialize(uint width, uint height, bool fullscreen, std::string window_title = "Rendering window"); void shutdown(); bool render(); GLFWwindow* getCurrentWindow(); private: GLFWwindow * window; Clock* mClock; }; } Timer.cpp
      #include <GL/glew.h> #include <GLFW/glfw3.h> #include <time.h> #include "Timer.h" namespace Game { Clock::Clock() : mTicksPerSecond(50), mSkipTics(1000 / mTicksPerSecond), mMaxFrameSkip(10), mLoops(0) { mLastTick = tick(); } Clock::~Clock() { } bool Clock::step() { if (tick() > mLastTick && mLoops < mMaxFrameSkip) return true; return false; } void Clock::reset() { mLoops = 0; } void Clock::update() { mLastTick += mSkipTics; mLoops++; } clock_t Clock::tick() { return clock(); } } TImer.h
      #pragma once #include "Common.h" namespace Game { class Clock { public: Clock(); ~Clock(); void update(); bool step(); void reset(); clock_t tick(); private: uint mTicksPerSecond; ufloat mSkipTics; uint mMaxFrameSkip; uint mLoops; uint mLastTick; }; } Common.h
      #pragma once #include <cstdio> #include <cstdlib> #include <ctime> #include <cstring> #include <cmath> #include <iostream> namespace Game { typedef unsigned char uchar; typedef unsigned short ushort; typedef unsigned int uint; typedef unsigned long ulong; typedef float ufloat; }  
    • By lxjk
      Hi guys,
      There are many ways to do light culling in tile-based shading. I've been playing with this idea for a while, and just want to throw it out there.
      Because tile frustums are general small compared to light radius, I tried using cone test to reduce false positives introduced by commonly used sphere-frustum test.
      On top of that, I use distance to camera rather than depth for near/far test (aka. sliced by spheres).
      This method can be naturally extended to clustered light culling as well.
      The following image shows the general ideas

      Performance-wise I get around 15% improvement over sphere-frustum test. You can also see how a single light performs as the following: from left to right (1) standard rendering of a point light; then tiles passed the test of (2) sphere-frustum test; (3) cone test; (4) spherical-sliced cone test

      I put the details in my blog post (https://lxjk.github.io/2018/03/25/Improve-Tile-based-Light-Culling-with-Spherical-sliced-Cone.html), GLSL source code included!
    • By Fadey Duh
      Good evening everyone!

      I was wondering if there is something equivalent of  GL_NV_blend_equation_advanced for AMD?
      Basically I'm trying to find more compatible version of it.

      Thank you!
    • By Jens Eckervogt
      Hello guys, 
      Please tell me! 
      How do I know? Why does wavefront not show for me?
      I already checked I have non errors yet.
      using OpenTK; using System.Collections.Generic; using System.IO; using System.Text; namespace Tutorial_08.net.sourceskyboxer { public class WaveFrontLoader { private static List<Vector3> inPositions; private static List<Vector2> inTexcoords; private static List<Vector3> inNormals; private static List<float> positions; private static List<float> texcoords; private static List<int> indices; public static RawModel LoadObjModel(string filename, Loader loader) { inPositions = new List<Vector3>(); inTexcoords = new List<Vector2>(); inNormals = new List<Vector3>(); positions = new List<float>(); texcoords = new List<float>(); indices = new List<int>(); int nextIdx = 0; using (var reader = new StreamReader(File.Open("Contents/" + filename + ".obj", FileMode.Open), Encoding.UTF8)) { string line = reader.ReadLine(); int i = reader.Read(); while (true) { string[] currentLine = line.Split(); if (currentLine[0] == "v") { Vector3 pos = new Vector3(float.Parse(currentLine[1]), float.Parse(currentLine[2]), float.Parse(currentLine[3])); inPositions.Add(pos); if (currentLine[1] == "t") { Vector2 tex = new Vector2(float.Parse(currentLine[1]), float.Parse(currentLine[2])); inTexcoords.Add(tex); } if (currentLine[1] == "n") { Vector3 nom = new Vector3(float.Parse(currentLine[1]), float.Parse(currentLine[2]), float.Parse(currentLine[3])); inNormals.Add(nom); } } if (currentLine[0] == "f") { Vector3 pos = inPositions[0]; positions.Add(pos.X); positions.Add(pos.Y); positions.Add(pos.Z); Vector2 tc = inTexcoords[0]; texcoords.Add(tc.X); texcoords.Add(tc.Y); indices.Add(nextIdx); ++nextIdx; } reader.Close(); return loader.loadToVAO(positions.ToArray(), texcoords.ToArray(), indices.ToArray()); } } } } } And It have tried other method but it can't show for me.  I am mad now. Because any OpenTK developers won't help me.
      Please help me how do I fix.

      And my download (mega.nz) should it is original but I tried no success...
      - Add blend source and png file here I have tried tried,.....  
      PS: Why is our community not active? I wait very longer. Stop to lie me!
      Thanks !
  • Advertisement
  • Advertisement
Sign in to follow this  

OpenGL ComputeShader Performance / Crashes

This topic is 817 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Made a (looong) GLSL ComputeShader for Tiled-Deferred rendering. On my laptop with a 2013 nVidia graphics card, it works fine. But now I'm sending the program to some other guys. And as you know, that's always where the headaches start smile.png Can't debug or whatsoever, only guess. I need your experience or guessing-powers to give me some directions!



Guy1 had a nVidia card. Not too old, certainly not new either. Video card driver hanged / crashed when running this particular ComputeShader. Disabling loops "fixed" it:

requires OpenGL 430
#define TILE_SIZE 32
layout (local_size_x = TILE_SIZE, local_size_y = TILE_SIZE) in;
shared uint _indxLightsPoint[ MAXLIST_LIGHTS_POINT ];   // Found PointLights, indexes to UBO lightArray
// 1. Let each pixel inside a tile check ONE pointlight, see if it intersects tile-frustum. Ifso, add to a shared list
uint thrID = gl_LocalInvocationID.x + gl_LocalInvocationID.y * TILE_SIZE;  // Each tasks gets a number (0,1,2, ... 1023)
if ( thrID < counts1.x ) { // "count1" comes from a UBO parameter. Would be "2", if there were 2 active lights in the scene
   if ( pointLightIntersects( tileFrustum, lightPoint[ thrID ].posRange ) ) {
      // Add lightIndex to list
      uint index = atomicAdd( _cntLightsPoint, 1 );
      if ( index < MAXLIST_LIGHTS_POINT ) 
           _indxLightsPoint[ index ] = thrID;
// 2. Loop through the lights we found
for (uint i=0; i < _cntLightsPoint; i++) {
    uint index = _indxLightsPoint[i];
    addPointLight( brdf, surf, lightPoint[index] );
} // for

Compiles, starts, hangs the video-driver. If I simplify all this code to a fixed " addPointLight( ... lightPoint[ 0 ] )", it works. And a damn lot faster as well (even though I only had 1 or 2 lights in the scene anyway). If I re-enable "barrier" or some of the atomic operations, the FPS crumbles again. My first thought was that the "FOR LOOP" went crazy, counting to an extreme high number. But even if I put a hard-coded number here, it still crashes. The other suspect might be an out-of-range array read, but I can't see how.


Could it be that "older" cards (2010..2012) have issues with (GLSL) Barriers or Atomic operations? Or maybe the hard-coded Tilesize (32x32) is too big? Although I would expect a compiler crash in that case.


Guy1 now has a new AMD card. But it seems it doesn't support some OpenGL 4.5.0 features (though all shaders use 430). Got stranded after that.




Guy2 had a 2011 nVidia card, don't know what exactly. Everything works, but graphics seem more blurry (anisotropic / mipmapping settings?). Moroever, framerate is horrible. Mines is ~50 .. 60 FPS at a larger resolution, his is 5. I expected a drop, but not that much. As usual there could be a billio things wrong, but my main suspects are:


- ComputeShader setup (tilesize 32x32 too big)

- ComputeShader operations (atomicAdd / atomicMin / atomicMax / FOR LOOP / Barrier )

- I assume 24+ texture units are available ( ie "layout(binding=20) uniform sampler2D gBufferXYZ;" ). I know older cards only have 16 or so. But again I would expect a crash then.

- Not using glMemoryBarrier( GL_ALL_BARRIER_BITS ); (properly), prior or after calling the CS



My guts say to replace the ComputeShader with good old Fragment shaders and such. Then again it just works well on my own computer. And since its quite a job to change, it would suck if something very different turns out to be the party-crasher.



Edited by spek

Share this post

Link to post
Share on other sites
You modify shared memory (_indxLightsPoint[ index ] = thrID),
you do a barrier, but you forget to do a memory marrier on shared memory as well.
You read shared memory (index = _indxLightsPoint), but it is not guaranteed that all threads see the expected thrID.

Maybe that's it. I'd not give up so soon because you have no shared memory in fragment shaders.
Personally i gave up on OpenGL compute shader because OpenCL was two times faster on Nvidia ans slightly faster on AMD 1-2 years ago.

For me it was absolutely necessery to stop the compiler from unrolling loops (forgot the command)
The compiler did not bother to unroll loops with > 1000 iterations smile.png

- ComputeShader setup (tilesize 32x32 too big)

It's always worth to try out, different hardware, differnt results. I'd assume 8*8 or 16*16 is better than 32*32.
On OpenCL the maximum for ATI is 512, but OpenGL spec requires a minimum of 1024, so i guess it's a slowdown for ATI to sync 1024 threads.
The hardware minimum for ATI is 64, NV 32. So in practice choose 64, 128 and 256 depending mostly on register usage. Edited by JoeJ

Share this post

Link to post
Share on other sites

Thanks for taking time to wrestle through my code pieces Joe!


>> you forget to do a memory barrier on shared memory as well.

All right. Adding "memoryBarrierShared()" in addition to "barrier()" would do the job (to ensure the index-array is done filling before starting the second half)?


Btw, besides crashes, is it possible that bad/lacking usage of the barrier as suggested can cause such a huge slowdown? Like I said, on my computer all seems fine, another one works as expected as well, but just very slow.



>> because OpenCL was two times faster on Nvidia ans slightly faster on AMD 1-2 years ago

Now that concerns me. Especially because I used OpenCL before, removed it completely from the engine, and swapped it for OpenCL (easier integration, more consistency)...  Doh!


Is it safe to assume that modern/future cards will overcome these performance issues? Otherwise I can turn my Deferred Rendering approach back to an "old" additive style. Anyone experience if Tiled Difference Rendering is that much of a win? And then I'm talking about indoor scenes which have relative much lights, but certainly not hundreds or thousands.


The crappy part is that I'm adapting code to support older cards now, even though I'm far away from a release, so maybe I shouldn't put too much energy on that and bet on future hardware.



>> Unroll

I suppose that can't happen if the size isn't hardcoded (counts.x comes from an outside (CPU) variable)?



Well, let's try the shared-barrier, different workgroup size, and avoiding unrolling. And see if these video-cards start smiling... But I'm afraid not hehe.

Share this post

Link to post
Share on other sites

>> you forget to do a memory barrier on shared memory as well.
All right. Adding "memoryBarrierShared()" in addition to "barrier()" would do the job (to ensure the index-array is done filling before starting the second half)?
Btw, besides crashes, is it possible that bad/lacking usage of the barrier as suggested can cause such a huge slowdown? Like I said, on my computer all seems fine, another one works as expected as well, but just very slow.

Yes, barrier syncs only the codeflow, so you need the memory barriers to ensure all writes are done as well.
This could cause e.g. a random huge number of lights and cause slow down / locking driver (but this seems not possible for your code).

On the CPU side, when you need te be sure a shader is done, the only thing worked for me was using a Fence.
glMemoryBarrier() or similar alone was not enough. I've had the feeling this was an Nvidia driver bug.

On AMD there was the problem that i had to remove all deprecated gl functions (like glVertex).
Otherwise compute shader produced wrong results. Checking GL errors helped to find those functions.

Is it safe to assume that modern/future cards will overcome these performance issues? Otherwise I can turn my Deferred Rendering approach back to an "old" additive style. Anyone experience if Tiled Difference Rendering is that much of a win?

I don't know if NV has improved their drivers. But they do support CL 1.2 now, with better OpenGL sharing.
I'd give it a try again to compare performance and troubles.
And looking at what 2.0 can do: GPU controls itself without those costly CPU <-> transfers just to get a number and start another kernel... that's exactly what we need.
(I don't know how DX12 or Vulkan can / will compete here)

Compute is worth it if you have an algorithm that can be made to profit from shared memory.
If you read the same stuff from global memory more than once, you can cache that data in shared memory.
You can build acceleration structures in shared memory to avoid typical fragment shader brute force crap. etc...
Just bang your head against the wall long enough until you get an idea how to make use of it smile.png

For the Unroll you're right. (Even for small loops disabling uroll is a win sometimes)

Extremely helpfull is a GPU profiler - ("Nsight"?)

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement