Jump to content
  • Advertisement

Matias Goldberg

  • Content Count

  • Joined

  • Last visited

Community Reputation

9635 Excellent


About Matias Goldberg

  • Rank

Personal Information


  • Twitter

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Matias Goldberg

    Speeding up voxelization from a high poly model

    I'd like to say a few things, since I've recently written a voxelizer in compute and done a fair amount of research: 1. GPU Rasterization to voxelize high poly meshes is a bad idea. GPUs are already bad at rasterizing tiny triangles, but this gets further aggravated by the fact that this approach requires interlocked operations and the high density of vertices means there is a lot of contention due to multiple threads trying to write into the same voxel block. Some papers mention their implementations have a drastic time increase with poly count due to contention. Triangle density per voxel also plays a big role, because it's not the same to have a mesh that has each voxel touch one or two triangles, than a mesh that has a single voxel with 600 triangles going through it. Another problem which most papers except a few often fail to mention (probably out of ignorance) is that unless the voxelization process is very simple, you need to blend your results; and there is no "interlocked average" instruction. Therefore implementations perform a mutex-like locking of a voxel. This is a problem because such approaches can result in an infinite loop because half a warp acquires the lock while another warp(s) acquires the other half, thus they will fight forever for acquiring the lock. Implementations that fail to account for this will result in a TDR, which is not immediately obvious unless you're working with high poly meshes, which is where contention happens and the infinite loop cases appear. Implementations that successfully account for this add a 'bail out' counter: If the mutex acquisition takes more than N spins, give up. This means the voxelization process may not be accurate, and worse it may not even be deterministic. But at least TDR won't happen. You could append those failure cases into a list and process them at the end serially though. The only way to properly implement this is using Independent Thread Scheduling introduced by Volta, and is only supported by NVIDIA GPUs (at the time of writing). This problem may not apply to you though, if you don't need any complex per voxel average/mutex. If a simple interlocked operation (like atomic addition) is enough, then ignore this drawback. You can avoid the "atomic blend" problem if your 3D texture is in float format, and track the accumulated weights in another 3D texture. This consumes a ton of memory. The "atomic blend" problem appears because of memory restrictions, thus we want to blend an RGBA8 texture or similar precision. 2. That leaves the opposite approach: Have each thread perform a box vs triangle test against all primitives. A brute force approach like that is super slow even for a GPU, much worse than doing GPU rasterization. However it can be greatly improved using hierarchy culling: partition the mesh into smaller submeshes, calculating its AABB, and then skipping all of those triangles by performing an AABB vs AABB test. The compute approach can be further improved by having each thread in a warp load a different triangle, and use anyInvocationARB to test if any of the 64 triangles intersects the AABB that enclosees all voxels processed by the warp. If you're lost about this, I explain this optimization in a Stack Overflow reply. While the theoretical performance improvement is up to 64x, in practice this optimization has yield us gains anywhere between 3x-32x depending on the scene involved (often between 3x-4x). This is what I ended implementing for Ogre 2.2; you're welcome to try our Test_Voxelizer.exe sample (build Ogre 2.2 using the Quick Start script). Find a way to load your mesh data as an Ogre mesh, modify the same to load this mesh of yours; and time how long it takes. That way you can easily test if this approach is worth pursuing or not. If it's not, then go back to the thinktank for something else. Note that you should test different values of indexCountSplit in 'mVoxelizer->addItem( *itor++, false, indexCountSplit );' as that value controls how big each partition is, and this can have a huge impact in voxelization performance. There is no 'right' global value, as the best value depends on how your mesh' vertex data is layed out in memory and how much space each partition ends up covering. Good luck Cheers
  2. Matias Goldberg

    Fence usage in double buffering

    And that's bad because...? It's called double buffering exactly because of this. The "double" in double buffer means the region of memory you write at frame 0 is different from the one you will be writing at frame 1, at least for the data that must change every frame (like view & world matrices, etc). If using an analogy, there's two trucks: You load packages in the first truck, when you're done the truck kicks off and now you start loading more stuff into the second truck while the first one is on route to destination and back. If you're lucky, by the time you're done loading truck #2, truck #1 has already arrived. If not, you'll have to wait a little. No, there is not. The GPU won't start executing your commands until you submit them, and the CPU won't do anything more because it's waiting on the fence. As per the truck analogy, truck #1 can't start until you're done loading all the packages, and once that's done; you sit idle for truck #1 to come back before you start working on truck #2. That's an inefficient use of your time. Yes, the framerate difference can be up to 2x. That's a lot. (YMMV, depends on the kind of workloads you're doing)
  3. I honestly can't remember right now exactly the reasons for using an input buffering. However the overal concept always boils down to 'dampening': Smoothening out 'bumpiness' or 'jerkiness' by introducing latency so that we don't starve out of inputs or receive them all together. This concept applies to rendering, input, networking, etc. Think of trying to drink from a malfunctioning tap water: It sends huge bursts, then nothing, then huge bursts again. You can't drink from that! You attach a bottle to the tap, wait to let it fill a little, then make a tiny hole in the bottle from where you can drink a constant, steady stream. Only if you decide to solve that problem in the way I described (only applies to deterministic simulations). If the server can't catch up, then players may experience 'pauses' (jitter) or 'slowdowns' (e.g. run a video at 0.75x speed). Note that 25% packet drop doesn't necessarily mean 0.75x playback rate, unless the server is super slow (i.e. it can barely hit the target rate) and you're actually sending your packets very spaced apart. If you send more packets, you compensate for the packet drop. This is a tricky thing, TCP assumes that if packets are being dropped, it's because servers are overwhelmed and cannot process it, hence improving packet sending rate makes things worse (thus you should send less). But in reality packets can also get drop because routes disappear, wifi signals are noisy, or Cablemodem/DSL cables have noise in them. Also TCP is sent in order and if a few packets are lost, TCP stops the whole thing and waits until it receives the missing packets. It's like saying "everyone silence! I want to hear it slowly, say it again"; whereas Glenn's method never stops because it includes redudancy (packet C includes the info from packets A and B, so if packet C arrives then everything can proceed; eventually the client gets notified A,B & C have been acknowledged thus it should stop sending them in packet E; and you should only stop the whole thing if a lot of packets have gone unacknowledged which either means the other end is dead, or the connection is extremely poor)
  4. He explains with more detail in his blog (which is down, using the web archive) Your concerns about latency are correct. There's something more to say about that: you can pretend the player is at a different location. If you download the youtube video and watch if frame by frame, you'll see that his deterministic video with TCP @ 100ms round trip and 1% packet loss is 100ms out of phase!!! It's just the big cube that is at the right location! The big cube is being rendered at a location it's not supposed to. It's actually lagging behind in the simulation too. The rest of the cubes are lagging behind, but the player's location is being predicted. If it's an action game, you need to do the same for other players, otherwise your players will shoot at a location the enemy is not there anymore. Deterministic lockstep for action games is a terrible idea though. It's more suited for slower games like strategy games with many hundreds of units. He's using an UDP implementation in that part of the video. His implementation is very tolerant to 25% packet loss because every packet contains a sequence of inputs of past frames that haven't been acknowledged. So even if 1 every 4 packets doesn't arrive, the other 3 packets have redundancy and contain what the 1st lost packet was carrying (and if the server is fast enough, it can catch up by simulating two frames in one frame, but beware that introduces jitter if not smoothed out well). As for the latency, again he is using client side prediction to display the big cube (remember what you see is not reality. In the client side simulation, the cube is in a different location; but it is rendered in a predicted position to hide the latency), if you watch it frame by frame, you'll see the small cubes are out of sync by a lot.
  5. If we travel back in time to 2010, there wasn't much to chose. Unity was paid and starting to get known, an Unreal Engine license costed tens of thousands of dollars. I don't remember if CryEngine was open to licensing. So the question would have been unequivocally "use Ogre3D or roll your own engine" for games. We did lose a lot in terms of community since lots of users started to migrate to all these new options. But the small community we still have usually responds quickly at least when it comes to 2.1, and I usually respond quickly as well. You still get support if that's what you're asking. But it's a far cry from our very active community 8 years ago, and a big community meant your obscure questions were likely to be answered (like "I can't get Ogre to run in a custom Arduino") which most of us cannot or have little experience. As for being an Ogre old timer: If you were familiar with Ogre 1.4/1.6/1.7/1.8/1.9 then we have a porting manual for you (online manual, more up to date; but the porting part hasn't changed). The recommended flow for old timers porting is Ogre 1.x -> 2.0 -> 2.1 Ogre 2.0 implemented some architectural changes to core (make SceneNode traversal cache friendly, SIMD and threadable) plus a new Compositor architecture (whose scripts are similar in syntax, thus relatively easy to port). In fact the samples from 2.0 are the same as 1.9; Thus Ogre 2.0 still feels very 1.4-1.9. Note that 2.0 is not actively maintained, and as the link explain, is a good "middle step" to move to 2.1 Ogre 2.1 added architectural changes to the renderer and material system. If you've used Unity or UE4, you'll feel the material system very familiar because "it just works" by defining PBR properties such as roughness, diffuse, fresnel etc. Which is very different from 1.x's materials (now called "low level materials" in 2.1; which are still there but mostly used for compositor effects, and are not recommended for scene objects) which required you to manually specify a vertex & pixel shader. So by trying them in sequence 1.x -> 2.0 -> 2.1; you can see the changes progressively. Going from 1.x -> 2.1 is possible but feels more shocking. Many concepts are still there: The ResourceGroupManager hasn't changed, the new "Item" that replaces "Entity" (now v1::Entity) work similarly (they still need to be attached a node, most functions have the same name, etc). Perhaps what's most confusing is that we call "HlmsDatablock" or simply "datablock" what you would normally call a "Material", because the name was taken by the old materials. Does that answer your question?
  6. Hi! I'm an Ogre3D dev. There are many things that can be said. As you said, Ogre3D is a rendering engine rather than a game engine. Although it's fair to say that compared to other rendering frameworks (e.g. bgfx, The Forge / ConfettiFX, D3D12 Mini Engine) we do a lot more, putting us closer to game engine, but not quite there. We don't do physics, sounds or networking. It's great if you want to glue different components together (or write your own), but not so much if you just want to start making a game out of the box ASAP. That's where Unity, UE4, Godot, Wicked, and Skyline shine as game engines. Ogre stays strong in non-games applications, such as simulation and architecture SW. As to development: Ogre has two main branches: +2.1 and 1.x I maintain +2.1 while Pavel maintains the 1.x one. Ogre 2.1 isn't WIP. In fact there are games released using it such as Racecraft and Sunset Rangers. Skyline Game Engine is built on top of Ogre 2.1 The main reason you haven't seen an official release is because our CMake scripts that generate the SDKs are ancient and the SDK they generate do not match the folder structure from an out-of-source build. This problem isn't new, it affects Ogre 1.x too; but it is too bothersome and there's a lot of CMake ancient legacy to sort out. Previously another reason was that we were waiting for GLES3 support (to get Android support for 2.1), but it became less of a priority given Android's extremely poor driver quality. But that doesn't mean Ogre is WIP. The main highlights from Ogre 2.1: Serious boost in performance. Ogre 1.9 used to be the slowest alternative. Pavel has been doing a lot of good work to improve that in 1.11; but in Ogre 2.1 we made architecture changes to address the problem from the root. It's common to hear 4x improvements when migrating to 2.1. When it comes to performance nowadays we do very well against the competition in this front. We seem to be popular for VR simulations thanks to that (but still take in mind VR requires rendering at 90fps in two eyes, and that is hard whatever engine you pick!). PBR material & pipeline Hlms materials are friendlier as we handle all shader work for you (unless you specifically don't want to), unlike 1.9 where you had to either write your own shaders or setup the RTSS component. There are many more differences but I don't want to clutter this forum thread and turn it into a Changelog With OpenGL you can target Linux (NVIDIA & Mesa-based drivers), with GL & D3D11 you can target Windows (minimum GPU: D3D10 hardware), with Metal you can target iOS, and with Metal you can target new Macs while you can use GL to target older Macs (our GL for Mac is Beta, not to mention Apple itself is deprecating it). We want to target Android and community member did an excellent job, but the poor driver quality continues to be an issue. We get deadlocks inside the driver while compiling shaders, incorrect results when using shadow mapping, crashes while generating mipmaps. We may address via workarounds for Ogre 2.2; or maybe through Vulkan. But at the current time, if you plan on targeting Android, you'll have to use Ogre 1.11 or something else. Of course it's not as easy as Unity where targetting another platform is just one click away. The rest of your application code has to be able to run on those platforms as well and deal with the differences. Ultimately it depends on what you want to deal with. Personally and most of the people I work with like to write custom stuff, specific for our needs. That gives us flexibility, better framerates and distinct look. Or it's because we like to own our tech ("owning" as in if something is broken we can fix it ourselves because we've got source code access, or we can write our own alternative, or we can swap another solution). UE4 gives you source level access (though it's a huge codebase) but it's not technically free. The standard Unity licenses doesn't give you source level access. It may be fine for you, but then if a feature you need is broken, you have to patiently wait for the next release for a fix (if it gets fixed, and pray the upgrade doesn't break your game), but I admit is the one with most user friendly interface. Godot is lighter and open source, so that engine would be my decision if I'd go for a game engine. Ogre3D gives you more control and power, and generally better performance which is something to take in mind if your game wants to have a lot of dynamic objects on screen (i.e. RTS games fit that description). But the downside is that you have to do a lot of tech work yourself. It's not "free" in that sense. Right now our development efforts are focused on Ogre 2.2 (which IS wip), and as our Progress Report from December 2017 says, it centers on a overhaul of texture code to heavily improve memory consumption and allow for background streaming (among other improvements). But we still add incremental features to 2.1; for example we've recently added approximate fake area lights (including textured area lights; the "fake" stands as in that the math is not PBR). Cheers
  7. Matias Goldberg

    HLSL's snorm and unorm

    UAVs require this. These modifiers are not meant for local variables thus it's very likely an unorm float local variable will just behave like a regular float. As to your out of range conversion, saturation happens. Snorm -1 becomes 0 when stored to an unorm buffer. 5.7 becomes 1.0 when stored to u/snorm, and -7 becomes -1 when stored to snorm. I don't remember about nans but I think they get covered to zero. Also watch out for -0 (negative zero) as it becomes +0 when stored as snorm
  8. Matias Goldberg

    User authentication without storing pw's

    That is because there is a high amount of high profile companies using extremely poor security practices. Yahoo! had their passwords stolen which were stored using MD5. LinkedIn was using unsalted SHA-1. They failed at basic security implementations that should be embarrassing given their size. Any strong password implementation should use salted bcrypt2 passwords, and the password exchange every time you login must happen adding extra randomized salts. Thus the passwords are never exchanged in plain text, the hashed password exchange is never the same twice (thus preventing spoofing) and if the passwords database gets stolen, it would take significant time (years) to crack them. The problem with emailing tokens is that emails are sent in plain text format and ping with a lot of servers. It's SO easy to steal them without you ever knowing (and without having to hack your email credentials in any way). I don't want this topic to become political, but that's why the investigation about Hillary Clinton is so important: she store confidential information in her private email account. Regardless of why she did it or whether she should have done that, doing that is extremely insecure. Basically, sending emails is is the equivalent of shouting extremely loud in public. Everyone knows. You can use public/private key encryption for emails, however that creates the problem of telling people to send you their public keys, which you'll have to keep in a database that can get stolen. Yes, stealing a public key is worthless. But I could argue so is stealing a salted bcrypt2 hashed password. Also if hackers manage to steal your private key, they can now impersonate you. Yeah, you can try to revoke your private key. But so can you ask the users to reset their passwords.
  9. The way I approached this was by conceptually splitting the textures into "data" and "metadata". Metadata is resolution, pixel format, type information (e.g. is this a cubemap texture?). Metadata is usually the first thing that gets loaded. So whenever I need it, I can call texture->waitForMetadata(); which will return when background thread gets that information. Or I can register a listener to get notified when metadata is ready. Of course management can get more complex: Do you want metadata to get prioritized? (RIP seek times) or are you willing to wait for 30 textures that come before to finish both their data and metadata before the texture you want gets to load the resolution from file? or maybe you want to design your code so that UI textures get loaded first? A complementary solution is keeping a metadata cache, that gets stored into disk so you can have this info available immediately on the next run. When the metadata cache becomes out of date, your manager updates the cache when it notices the loaded texture didn't match what the cache said. You could also build the cache offline. If you rely on the metadata never being wrong, then you need to write some sort of notification & handling system (tell everyone the data has changed and handle it, or abort the process, notify the user and run an offline cache tool that reparses the file; or look at timestamps to check what needs to be updated, etc)
  10. Matias Goldberg

    Semi Fixed TimeStep Question.

    Suppose you minimize the application so everything stops; 30 seconds later the application is restored. The counter will now read that you're lagging behind 30 seconds and needs to catch up. Without the std::min, and depending on how your code works, either you end up calling update( 30 seconds ) which will screw your physics completely; or you end up calling for( int i=0; i<30 seconds * 60fps; ++i )update( 1 / 60 ); which will cause your process to become unresponsive for some time while it simulates 30 seconds worth of physics at maximum CPU power in fast forward. Either way, it won't be pretty. The std::min there prevents these problems by limiting the time elapsed to some arbitrary value (MAX_DELTA_TIME) usually smaller than a second. In other words, pretend the last frame didn't take longer than MAX_DELTA_TIME, even if it took more. It's a safety measure. Edit: Just implement the loops talked by gafferongames in a simple sample, and play with whatever you don't understand. Take it out for example, and see what happens. That will be far more educational than what we can tell you.
  11. Matias Goldberg

    what is the appeal of fps games?

    Mass Effect was a shooter, but it wasn't a first person shooter. It's a Third Person Shooter, such as the likes of Tomb Rider, Just Cause, GTA, Hitman, etc. Not quite the same beast. Yup. It's good to point it out because the user base usually takes a great pride in the "realism" from such games (it's the users? the marketing? I don't know), when the reality is that "realism" isn't fun. Reality means a bullet travelling at super sonic speed you first get hit then hear the sound coming. A fun game means sound is simultaneous with impact so the player knows where the bullet is coming from in order to shoot back.
  12. Matias Goldberg

    Resolving MSAA - Depth Buffer to Non MSAA

    In my experience sampling from an MSAA surface can be very slow if your sampling pattern is too random (i.e. I attempted this at SSAO and SSR, didn't go well). I'm not sure why, whether it's because of cache effects or the GPU simply not being fast at it. But likely for your scenario that's not a problem. If you're doing this approach using a regular colour buffer (instead of a depth buffer) would be better. Depth buffers have additional baggage you don't need (Z compression, early Z) and will only waste you memory and cycles. Btw if you're going to be resolving the MSAA surface, remember an average of Z values isn't always the best idea. That won't work because the water still needs the depth buffer to avoid being rendering on top of opaque objects that should be obstructing the water. However as MJP pointed out, you can have the depth buffer bound and sample from it at the same time as long as it's marked read-only (IIRC only works on D3D10.1 HW onwards). This is probably the best option if all you'll be doing is measuring the depth of the ocean at the given pixel. Cheers
  13. Matias Goldberg

    Trying to finding bottlenecks in my renderer

    You're still not using the result. printf() rData[0] through rData[3] so it is used. And source your input from something unknown, like argv from main or by reading from a file; else the optimizer may perform the code at compile time and hardcode everything (since everything could be otherwise be resolved at compile time rather than calculating it at runtime). Sure. I can save you some time by telling you warp and wavefront are synonims. A warp is how NVIDIA marketing dept calls them, a wavefront is how AMD's marketing dept calls them. You can also check my Where do I start Graphics Programming post for resources. Particularly the "Very technical about GPUs." section. You may find the one that says "Latency hiding in GCN" relevant. As for bank conflicts... you got it quite right. Memory is subdivided into banks. If all threads in a wavefront access the same bank, all is ok. If each thread access a different bank, all is ok. But if some of the threads access the same bank, then things get slower. I was just trying to point out there's always a more efficient way to do things. But you're doing a videogame. At some point you have to say "STOP!" to yourself, or else you'll end up in an endless spiral of constantly reworking things out and never finishing your game. There's always a better way. You need to know when something is good enough. I suggest you go for the vertex shader option I gave you (inPos * worldMatrix[vertexId / 6u]). If that gives you enough performance for what you need, move on. Trying Compute Shader approaches also restricts your target (e.g. older GPUs won't be able to run it) which is a terrible idea for a 2D game.
  14. Matias Goldberg

    Trying to finding bottlenecks in my renderer

    I got confirmation from an AMD Driver engineer himself. Yes, it's true. However don't count on it. The driver can only merge your instancing into the same wavefront if several conditions are met. I don't know the exact conditions, but they're all HW limitation related. i.e. if the driver cannot 100% guarantee the GPU can always merge your instances without rendering artifacts, then it won't (even if it were completely safe given the data you're going to be feeding, but the driver doesn't know that a priori, or it would take a considerable amount of CPU cycles to determine so). When it comes to AMD, access to global memory may have channel and bank conflict issues. NVIDIA implements it as huge register file, so there's always a reason...
  15. Matias Goldberg

    Trying to finding bottlenecks in my renderer

    You don't have to issue one draw call per sprite/batch!!! Create a const buffer with the matrices and index them via SV_VertexID (assuming 6 vertices per sprite): cbuffer Matrices : register(b0) { float4x4 modelMatrix[1024]; //65536 bytes max CB size per buffer / 64 bytes per matrix = 1024 // Alternatively //float4x3 modelMatrix[1365]; //65536 bytes max CB size per buffer / 48 bytes per matrix = 1365.3333 }; uint idx = svVertexId / 6u; outVertex = mul( modelMatrix[idx], inVertex ); That means you need a DrawPrimitive call every 1024 sprites (or every 1365 sprites if you use affine matrices). You could make it just a single draw call by using a texture buffer instead (which doesn't have the 64kb limit). This will yield much better performance. Even then, it's not ideal, because accessing a different matrix every 6 threads in a wavefront will lead to bank conflicts. A more optimal path would be to update the vertices using a compute shader that processes all 6 vertices in the same thread, thus each thread in a wavefront will access a different bank (i.e. one thread per sprite). Instancing will not lead to good performance, as each sprite will very likely will be given its own wavefront unless you're lucky (on an AMD GPU, you'll be using 9.4% of processing capacity while the rest is wasted!) See Vertex Shader Tricks by Bill Bilodeau.
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!