Banner advertising on our site currently available from just $5!
Ryan_001Member Since 23 Apr 2003
Offline Last Active Today, 03:18 PM
- Group Prime Members
- Active Posts 500
- Profile Views 6,538
- Submitted Links 0
- Member Title Member
- Age Age Unknown
- Birthday Birthday Unknown
Posted by Ryan_001 on Yesterday, 01:29 PM
Posted by Ryan_001 on 28 June 2015 - 12:22 PM
Posted by Ryan_001 on 20 June 2015 - 05:35 PM
Intel chipsets can stream data to RAM at about 13GB/s, but PCIe can handle about 16GB/s. That's 200MB+ of dynamic data per 60Hz frame. Streaming to GPU RAM is fast.
Fast isn't quite accurate. They're very high throughput, not low latency.
The APIs do help out -- often if you ask to map a GPU resource for writing, the driver will give you a direct pointer into the GPU RAM, with the pages marked as write-combined (so that the CPU hardware will automatically buffer up large PCI-transfer-sized chunks of data when streaming).
The new APIs (D3D12/Vulkan) make this explicit, adding enums for page cache policies (write-combined vs write-back) and making GPU-resident pointers an official feature. However, this is all the same on shared memory architectures. Different parts of your shared memory is configured to use different caching policies -- generally when streaming data "to the GPU", you'll still be having the CPU write into write-combined pages. The only difference is that now instead of those writes going over your PCIe bus into GDDR5, they're going over your regular system bus into your DDR3... so that the CPU+GPU both share that 13GB/s, instead of having a total 13+16GB/s. Future systems (e.g. AMD/nVidia ones), would have to add a lot of extra CPU<->RAM interconnects to make up for the fact that the fast PCI bus has been removed!
Except were talking now HBM/on chip memory, not DDR3, DDR4, or GDDR5. Granted you still have a shared bus. But I think that HBM allows a little more room to play. On the Fury its a 1024 bit bus (if I'm reading it correctly). Its already far faster than anything else out there and will only continue to get faster as they work the kinks out. Clearly the CPU-side will need a larger cache, or maybe even an expanded shared level 3/level 4 cache. The cache policies can also be made to work in our favor here. There are times when the CPU would want to work on large blocks of data, and times when it needs smaller/lower latency access, likewise for the GPU. Instead of these pages being 'GPU' memory and these other pages being 'CPU' memory, it could simply be these pages are high throughput/high latency, and these others are low through/low latency (and other various combinations). You put your low frequency updated data in the first set of pages, put the high frequency updated data in the latter set, GPU or CPU it doesn't really matter, the access patterns are similar enough.
Current APIs support two methods for moving data between system and GPU RAM -- either the CPU does it (which can be asynchronous with GPU tasks), or the GPU does it, which blocks all graphics tasks until the transfer is complete. Generally you want to use the first option, because it allows you to hide the transfer cost inside your GPU command latency window, so the critical path doesn't grow (assuming you're GPU-time bottlenecked).
The new APIs however, add support for GPU-side DMA command queues, where you can request the GPU to perform transfers that are asynchronous with GPU tasks -- having the GPU push or pull data won't block regular drawing commands, which is great -- we don't have to waste the CPU's time on moving data around any more... but, in order to make full use of these features, you need your regular drawing commands to have a decent amount of latency, so that your DMA commands can complete inside that latency window...
Or if shared memory is fast enough, perhaps we can just finally do away with all the nonsense and just pass a few pointer around.
Exactly! It only takes 1ms of CPU time to generate enough commands to keep a GPU busy for 33ms... which means the CPU has got itself almost a whole frame's worth of time to fill with other work before it even has to think about generating another command buffer. Some older APIs were designed for the CPU to constantly be flushing commands to the GPU as they're generated, but the new APIs are very much designed around the CPU very quickly creating very large command buffers, which will be flushed through to the GPU in large blocks, not draw-by-draw.
The fact that the CPU really shouldn't be doing any significant amount of (graphics) work, while the GPU does a whole frame's worth of work, is why we have the typical one-frame-latency on GPU commands -- it's so that the GPU can be kept as busy as possible.
Yes but that's 'old-style' thinking. That's the way we do things now because of the high latency, not because its what we want to do. If we had low latency we could find a ton of cool things for the CPU to do so that its not 1ms:33ms. It could be closer to 30:33. From dynamic data generation, to better AI, to complex LOD/culling. There's a whole slew of cool thing we could be doing if there wasn't this massive latency.
Unfortunately , the windows Kernel does have to patch your command buffers, depending on the memory hardware being used under the hood, for stability/visualization/security purposes. Consoles don't have this problem (they can do user-mode submission), but only because a lot more trust is placed on the developers to not undermine the operating system.
Newer hardware with better memory and virtualization systems might be able to fix this issue.
Yes, and unfortunately without both video card manufacturers and the OS developers on board this sort of tight integration won't be seen anytime soon. One can dream though
E-Sports is a big market now, and will only grow. And in that market, its not the flashiest graphics, the largest textures, or the highest resolutions that win. They want massively high and stable framerates. The high framerates are necessary simply because current hardware/APIs queue up so many frames. If you dropped the latency, you could also drop the framerate and few if any people would notice. For example even Mario Bros on the NES had lower latency total (ie. the time from you hitting jump till the time the phosphors on the monitor changed) than even most top end AAA games running at 60+fps. For all the pretty polygons we throw around, we've lost alot in the process. I think (hope) it'd be awesome if HBM allows us to finally start moving back to lower latency gaming. Its obviously not the only piece of the puzzle, but it is a rather critical one IMHO.
Posted by Ryan_001 on 18 June 2015 - 02:33 PM
Posted by Ryan_001 on 03 May 2015 - 08:12 PM
I ask because I know from the FreeType mailing list that people have been having trouble with very script-like fonts and many Asian fonts. There are so many individual sub-glyphs within a single glyph that they are not being rendered correctly.
I'm sure they've tested theirs on more fonts than I have mine. Perhaps mine would work where there's do not? I didn't write this to replace another library or be 'THE' font rendering library, rather it was something fun to do at the time and something I needed. That said the only real issue that I could see causing problems would be where you have intersecting contours would cause an erroneous triangulation.
If you have any fonts in mind (along with their corresponding code-points) I'd be happy to test it against them.
Posted by Ryan_001 on 03 May 2015 - 07:59 PM
I wrote this: https://sourceforge.net/projects/ttftriangulator/ which might be of help. Its a library that converts font glyphs into actual mesh data for rendering.
That's nice! Can it support non-English glyphs?
Yes, and it should (though I've never tested it) support both horizontal and vertical kerning.
Posted by Ryan_001 on 21 April 2015 - 06:12 PM
If its an fx shader (has things like a technique/pass data) you can compile them using the fx profiles (
fx_3_0, fx_4_0, fx_5_0, etc...). Now you will get an effect file that required the effects framework to use. If you want just the shader and not the full effect then as TiagoCosta pointed out, you'll have to compiled it multiple times with vs_3_0, ps_3_0, etc...
Posted by Ryan_001 on 17 April 2015 - 06:27 PM
Template when you can, macro when you must.
Posted by Ryan_001 on 31 March 2015 - 12:56 AM
Learning about WHY lock-less structures are not all they seem to be is very important IMHO. As Hodgman said, often you'll find the additional algorithmic complexity overshadows any gains from lock-less queue's. Being able to implement one is less important IMHO.
Posted by Ryan_001 on 02 March 2015 - 08:44 AM
An easy way (but a bit slow) that I did it was to use the edge circumcircle property. From some lecture notes they state (pg. 27, Def 28) that an edge is delaunay if it has at least 1 empty circumcircle. So this makes for a pretty easy brute force algorithm. For each candidate edge you simply count the number of vertices in its circumcircle. Once you have your edges you can easily form triangles, or a Voronoi or what-have you.
Posted by Ryan_001 on 24 February 2015 - 06:01 AM
When I was young my father (a chemist) invented and patented some process that saved the company he worked for a ton of money. They gave him an award and two of these:
They had 3KiB of ram/storage (it was both) and could run basic programs. When I was 5 while on a long road trip, knowing my curiosity for anything 'techy' he tossed one over the backseat to me to stop me from pestering my sister. I was instantly fascinated but after 2 days of reading the manual and playing around could not a get a program running. We were staying at a resort and one of the people at the resort happened to be a programmer. To this day I don't remember the details of what he did, something about medical educational software, but he didn't really take me seriously (I was 5 after all) and kept trying to 'dumb things down' despite my insistence otherwise. None-the-less I was able to pry a few bits of information out of him and he walked me through my first 'hello world' program and taught me the basics of variables. The rest they say... was history.
Don't underestimate the younger ones, some (as I imagine many on this forums understand) are just born with a natural predisposition to this sort of logical thinking. As a young kid I never felt I was 'learning' programming, but rather that I already knew it and just had to figure out how to get the computer to listen to me.
I would also contend Gian-Reto, that sitting has no correlation to learning; and that for many little ones it has exactly the opposite effect.
Posted by Ryan_001 on 06 February 2015 - 10:58 AM
I think it has to (though like you state, I've never seen it officially stated in the documentation anywhere). If they were streamed out of order, things like transparency, overlapping tris (in the absence of a depth buffer), and stencil operations would be incorrect.
That said I know (or at least it works on my gtx 580) SV_PrimitiveID does match the input stream (meaning the 1st triangle will have id = 0, 2nd one id = 1, ect...) so if worse comes to worse you could stream out SV_PrimitiveID with the rest of the stream out vertex.
If you find an official answer post it here please because I'd be interested to know.
Posted by Ryan_001 on 10 January 2015 - 11:27 PM
Why are you creating and destroying threads constantly, that's a bad design. Pool them. Or use std::async to start your cheap jobs, since it will use a thread pool.
While I agree that his method is not optimal, a bug in the standard library is a pretty big deal. If happened to find a way to reproduce this bug, even if by accident, that's a good thing. My suggestion would be to try simplifying it. See if you can remove the networking code. Get it as small as possible while still being able to reproduce the problem.
Posted by Ryan_001 on 04 January 2015 - 07:45 PM
Well one thing I notice is you've got the same header guard on both files. This might cause some of this as well. It might only be loading one of the files, which ever one is parsed first. The other, while included, isn't bringing in any data, since the header guard would already exist and it would ignore whatever is in the ifndef block.
Try making the header guards more unique.
Since the files seem to be in different directories, try something like GUI_SPRITE_H and RENDER_SPRITE_H.
Or just use #pragma once