Jump to content
  • Advertisement


  • Content Count

  • Joined

  • Last visited

Community Reputation

452 Neutral

About C0D1F1ED

  • Rank
    Advanced Member
  1. C0D1F1ED

    multithreading software renderer

    Quote:Original post by RobTheBloke As I mentioned above, knights ferry.... Knights Ferry is a co-processor on an expansion card. It's basically a 32-core Larrabee card without the graphics output, and won't be available for the consumer market. So it's not bringing LRBni any closer to CPUs. Quote:Original post by RobTheBloke Quote:Original post by C0D1F1ED It's pretty clear that IGPs have no future, and even the lower range of discrete graphics cards could be pushed aside in the foreseeable future. Nonsense. The world is moving towards better energy efficiency. An all seeing, all dancing CPU that can handle 3D graphics, is not, and will never be, as energy efficient as some purpose built 3D hardware. The foreseeable future is filled with smartphones and netbooks. It is not filled with 8 core AVX enabled i7's for the vast majority of people on this planet. IGP's are here to stay for at least another decade or two.... The move towards better power efficiency doesn't mean using the CPU as a graphics processor isn't viable. Power efficiency is only one part of the equation... First of all, the CPU is already used for a multitude of tasks, none of which it is the most power efficient piece of hardware for. But it doesn't matter. It's much cheaper to have a central processors that is capable of performing multiple tasks. For example audio processing used to require an expansion card, but nowadays pretty much everyone uses audio codecs that run as a driver on the CPU. And although graphics is a much heavier task than audio, the workload is only increasing slowly, while CPUs are rapidly increasing their computing capacity. What this means is that they'll be adequate for graphics one day soon. There will obviously never be enough computing power for cutting edge games, even using discrete graphics cards, but things like 3D desktops and casual games don't need much more than what today's IGPs offer. AVX and FMA alone will quadruple the CPUs processing power, and that's without counting the increase in core count. Once you get an adequate level of performance for graphics out of a CPU with reasonable power consumption, there's no point in having another piece of logic perform the same task at higher efficiency. Nobody cares, just like nobody cares that audio processing is done in a software driver. Also note that the peak power consumption isn't reached all that often. People who buy a system with an IGP aren't that interested in graphics in the first place, and the percentage of time during which they use the IGP intensively is quite small. This means that in the near future it becomes hard to justify investing into a redundant piece of silicon that isn't used all that often. It's also readily obvious that components which look alike become shared or are merged. That's happening today and clearly indicates that power efficiency isn't the only driving force. We already saw low-end GPUs evolve into IGPs, ditching the dedicated graphics memory. Now the dies fuse together, and they share caches! Once the CPUs load/store units get gather/scatter support, there really isn't that much left that sets them apart, and sooner or later they will merge as well. Note that on the GPUs side, vertex and pixel pipelines were unified, and they got support for things that were previously CPU only territory. So the convergence happens from both ends and it's only logical that the hardware starts to look more alike, up to the point where keeping them separate just isn't buying your anything any more. But even though power efficiency isn't the only factor, it's improving with every CPU generation. The clock frequency pretty much stagnated, and instead the performance is increased by adding more cores and making each core more powerful. AVX and FMA are also leaps forward in power efficiency. AMD's Bulldozer architecture even aims to double the number of cores with a very small impact on transistor count, by sharing the front-end between pairs of cores. This increases power efficiency as well, and it's a technique inspired by GPU architectures. For GPUs themselves things aren't that rosy. They need to increase the clock frequency to prevent the transistor count from getting too high. This lowers the GPUs power efficiency. So even for this parameter, they are also converging closer together. Note also that when you compare software rendering against hardware rendering, the CPU is also running the actual application. If you singled out the graphics processing, the power efficiency would be higher. In other words for the same power envelope you could have a CPU with more cores which performs closer to the IGP. Last but not least, a powerful CPU would not only benefit graphics, but any other compute intensive application as well. GPGPU on high-end graphics cards can be interesting, but it's a joke on IGPs. There's not a single thing an IGP is useful for other than rasterization graphics. It won't take long for CPUs to rival that as well.
  2. C0D1F1ED

    multithreading software renderer

    Quote:Original post by Ravyne We certainly haven't heard the last of Larabee -- its vector extensions are already on Intel CPU roadmaps... I don't think LRBni support will be added to the CPU cores. Its encoding is too different from AVX (which was designed to be extendable up to 1024-bit) for that to make sense. What's much more likely is that gather/scatter support will be added to AVX. Also, the IGP part could get support for LRBni at some point, but that doesn't make a lot of sense to me either. AVX with FMA and gather/scatter would be just as powerful, so you're better off with a few more CPU cores than a heterogeneous architecture with inherent limitations. I don't see room for two different vector ISAs, and since AVX will be in every single CPU starting from Sandy Bridge and Bulldozer, my expectation is that LRBni will eventually go the way of the dodo. Anyway, if you know of any roadmaps that indicate something different, please let me know. Just to illustrate what a CPU with AVX, FMA and gather/scatter could do: Today a GeForce GTX 460 achieves an average of 88 FPS during the 'Deep Freeze' test of 3DMark06. The same test runs at 3 FPS on a quad-core Core i7, using SwiftShader. This hardware costs roughly the same (note that a GPU is worthless without a CPU though). Now, a mainstream Sandy Bridge chip without IGP can have six cores, and AVX initially doubles the throughput. If it also had FMA support (which really wouldn't be much of a stretch), we're looking at 6 times higher throughput compared to my current quad-core. So in theory the performance gap could be reduced to a factor of five, for graphics, using readibly available technology! That's theory though. In practice AVX as specified today won't scale that well because of the sequential load/store bottleneck. But this can be fixed with gather/scatter support. In fact even without AVX the CPU would benefit greatly from gather/scatter support. So the gap would be much smaller than a factor of five. All this assumes the die size won't change. Of course by the time gather/scatter support is added, GPUs will have increased their throughput, but the number of CPU cores will increase at the same rate, so we can factor that out. GPUs will even become more generically programmable and better at non-uniform tasks, which will cost compute density, again reducing the gap for graphics... It's pretty clear that IGPs have no future, and even the lower range of discrete graphics cards could be pushed aside in the foreseeable future.
  3. C0D1F1ED

    multithreading software renderer

    Quote:Original post by Ravyne On something like a 4 core, 8 thread i7 processor from Intel, you might get something approaching Geforce 5x00 or ATI Radeon 9800 levels of performance and programability... On a Core i7, SwiftShader outperforms a Radeon 9800. It also offers Shader Model 3.0 capabilities, while the 9800 is limited to Shader Model 2.0.
  4. C0D1F1ED

    multithreading software renderer

    Quote:Original post by Bejan0303 You know, think I could revolutionize the industry (jk). You can. With every generation, CPUs and GPUs are converging closer together. Multi-core and wider vectors drastically improve the CPUs throughput, while at the same time GPUs are trading compute density for more generic programmability. This means that one day you'll be able to write a software renderer that can compete with the GPU. This may happen sooner than you might think, especially for IGPs. A year ago they were moved from the motherboard to the CPU package, but they're still separate chips. Next year, both Intel and AMD will lauch CPUs that have an IGP on the same die. The next logical step is to completely unify them. The Intel HD Graphics IGP used in the upcoming Sandy Bridge processor generation will likely offer 130 GFLOPS of computing power. That's not a lot. In fact a quad-core Sandy Bridge will offer over 200 GFLOPS thanks to AVX. And the compute density will double again with the Haswell architecture. The only thing preventing the CPU from performing well at gaphics is the lack of gather/scatter support. These are the parallel equivalent of load/store instructions, and without them things like texture sampling, raster operations, attribute fetch, etc. are not very efficient to implement. It's not that difficult for Intel or AMD to add support for gather/scatter though. Intel has already done it for Larrabee, and even today's CPUs already have half of the required logic (to support unaligned memory accesses). So sooner or later CPUs will support gather/scatter and this will make the IGP useless. CPU manufacturers will still provide Direct3D and OpenGL drivers, but note that this hardware unification also offers a tremendous opportunity to develop your own graphics architecture in software! This will allow you to outperform the legacy graphics pipeline by having more control over all of the calculations and the data flow. And it allows to do things the GPU is not capable of or not efficient at. So once these advantages start to take effect, GPU designers will have no other choice but to make their architecture fully generic as well (capable of efficiencly running very complex software written in C++). So they'll also resort to exposing the hardware directly, allowing software rendering technology to expand to mid-end and high-end systems. All this means that now is the right time to learn multi-threading and develop vectorized software.
  5. Quote:Original post by keinmann auto str = "Hello world!"; Did you intend const char* or std::string? There is no ambiguity. The right hand side is of type 'char*'. If you intended str to be an STL string, you should have explicitly written 'std::string("Hello world!")', but that obviously defeats the whole purpose of using 'auto'. That said, 'auto' wasn't really meant to be used for simple types. It's especially useful when composing complex templatized types. So it is recommended to always avoid using 'auto', unless you really determined it's the most elegant solution with minimal drawbacks.
  6. Just make the method you want to call virtual. That way its pointer will be added to the class' vtable, and you won't get an undefined reference error because it never needs to link the static address.
  7. C0D1F1ED

    What exactly does a software patent protect?

    You could use nvidia-texture-tools. According to the FAQ, this library has a license to use S3TC. If you really want to use your own implementation, I would first contact S3 before asking any lawyer for help. Chances are they'll actually let you use it for free under certain circumstances.
  8. C0D1F1ED

    Software renderer SSSE3

    Quote:Original post by rarelam What do you mean by proper mip mapping? Using all the right formulas and computing the mipmap level from the actual texture gradients. Aliasing means bad cache use, so it's important not just for quality but performance too. Quote:Currently I use power of 2 symmetric textures, and the mipmaps are basically a series of subsampled textures halved in each dimension down to 1x1. I select mipmaps per 4x4 block of pixels, based on the exponent of the top left pixels texture gradient. It is not completely accurate but it does give a pretty good speed up on big textures. Any suggestions for improving this? I don't think you can compute the mipmap level based on the exponent of the gradients alone. If I recall correctly the full formula is: lod = log2(sqrt(max(du^2/dx + du^2/dy, dv^2/dx + dv^2/dy))). The log2(sqrt(.)) can be computed from the exponent, but you can't skip directly from the gradient exponents to the lod. Quote:I do not have good tool for measuring cpu bottlenecks, any free suggestions? AMD's CodeAnalyst is really nice. It also works for Intel processors, although you won't have any event based information. But by profiling your experiments you'll be able to quickly locate the hotspots and optimize them. Quote:I am going to try swizzling the frame buffer so that my first pass is rendered to 64byte aligned targets. When I shade the deferred light pass I write to a different buffer for for the final display, so the swizzling back is free. Free as in no different to what I have now. Makes sense. Quote:I know it is not easy to say without seeing any code or profiling, what have been the bottlenecks in your experience? You should try to avoid moving data around. One of my early vertex processing implementations used a cache to avoid reprocessing the same vertex (due to sharing them between triangles). However, I still copied each triangle's vertex from this cache. With each vertex being shared 3-4 times on average, this took about as much time as the vertex processing itself. Later I avoided all this copying by referencing the processed vertices by pointer. The cache structure got more complicated (as I don't want to overwrite results that are in use), but overall things got faster. I used the same approach for Sutherland-Hodgman polygon clipping as well. As a general advice, try to keep the bigger picture. When I optimized vertex copying I started by only copying the components that are actually needed. It's only later that I realized I didn't have to copy anything at all. Always remember that premature optimizations make the code harder to maintain and are a waste of time.
  9. C0D1F1ED

    Software renderer SSSE3

    Quote:Original post by rarelam Texture access is not very efficient. I use the 16bit madd instruction to calculate the texel address(u*1+v*width), as I already have the uv's in an sse register in this format. Then shuffle and movd to an x86 register to access the texel from memory. I then rebuild back to a single sse register containing 4 32bit color values. Is there much of a difference cache performance if I moved to sizzled textures? In my experience texture swizzling is not beneficial. The overhead of adjusting the addresses is higher than what you gain. Caches are quite large these days, so even if the first several fetches are all cache misses, the next ones will all be hits. In average, the number of misses isn't that different. This may be different for Atom processors though, as they have limited caches and in-order execution actually demands the data to be in L1 cache for best performance. Note though that proper mipmapping is absolutely critical to achieve good cache performance. Quote:Deferred lighting from the depth buffer, I reverse project the surface position from the z buffer and use this in the lighting calculations. I use a similar process for calculating the DOF. With multiple lights and DOF, I am repeating the calculation several times for each pixel position. Would I be better off to store the full positions in a buffer at the raster stage? That really depends on the available memory bandwidth. If your shader does nothing but arithmetic operations then reading and writing the full position will be cheaper than adding even more arithmetic operations for reconstruting the position. If instead you're already memory bound then computing the position arithmetically will be faster. Quote:Currently I rasterize blocks of 4x4 pixels, I am considering swizzling the frame buffer, so that each block is 64byte aligned, rather than 4x16byte aligned at totally different locations for color, normal and depth, how much of a difference is this likely to make? I know the best answer to all these questions is to implement each one and test the difference in performance. However I would appreciate anyones input, based on there experience, what sort of improvement I could expect. In my experience it doesn't help. Remember that you'll have to unswizzle the frame buffer again in the end, and this extra pass easily takes longer than the bit of time you might save from having the blocks fit on a single cache line. Your mileage may vary, but even in the best case we're talking about a few percent performance gain. Frankly it's not worth the trouble and I would focus attention elsewhere. Quote:Currently shaders are all defined as templated functions within a main rasterize function, Which basically draws blocks of 4x4 pixels. I want a more generic way of writing shaders, I have looked at the swiftshader approach, but the calling overhead for a function per pixel is too high(unless I have misunderstood. Does anyone have a relatively clean approach for writing shader to be used on a block of 4x4 pixels? SwiftShader doesn't call a function per pixel. The implementations have varied, but it has always processed multiple pixels per call. In your case I would consider constructing a function to process 4x4 pixels at once. Per 16 pixels the calling overhead really isn't that high. But to improve on it further you can have the function iterate over the blocks by itself too. I can highly recommend to take a look at SoftWire or AsmJIT. Good luck!
  10. It looks like you tried to use the code from my Advanced Rasterization article, but you broke the sub-pixel accuracy. Try reading the article again till you fully understand it, and don't change the algorithm unless you know what you're doing.
  11. C0D1F1ED

    Photos of Lego

    The real question is are you using their intellectual property to make money out of it or not? If you create a game like LEGO Star Wars then obviously you're using their brand for your own benefit, so you'll need an agreement with them (quite likely meaning you'll have to pay). If instead you're just using the square bricks and the real work is in the art you create out of it, then there's absolutely no need to worry. You could have just as easily used crayons or colorful pebbles... Of course the grey area is when you're combining their intellectual property with yours. Say for instance you create a nice house, using the bricks but also some of the complex shapes with maybe some nice stickers on them. In this case the creation is partially yours and partially theirs. Then you still have two choices: Do nothing, and just publish your game and see whether LEGO objects, or actually regards it as good advertisement. A dispute, if it ever comes to that, can still be settled later. Note that companies don't instantly sue anyone unless they can actually win something by doing so. If your game is small scale, it's just not worth the negative PR for them. They even actively encourage people to photograph their creations, although that's mostly for personal use and not commercial use. The second option is to contact them early. This is best when you're planning on using LEGO to save yourself the extra work. But it may result in signing an agreement that ends up costing you a lot of money, before knowing exactly how many copies of the game you'll sell. So you have to weigh the risk of getting sued later against the risk of paying too much up-front. Anyway, I'm not a lawyer either but I really believe that putting yourself in the position of the LEGO company and using some common sense will quickly answer your question.
  12. C0D1F1ED

    C++0x no more

    Quote:Original post by Promit P.S. If anybody recommends "C++0xA" I will beat you senseless. C++0Ah then.
  13. I once had a program that would crash on a specific Mac laptop running Windows, but not on any other Windows system, and only after the mouse pointer was moved using the trackpad. It was not my system, but the owner was able to send me a crash dump. After careful examination at the assembly level, I found that an MMX register had magically changed it's value, over a sequence of instructions that contained no MMX instructions whatsoever! This situation was seemingly IMPOSSIBLE. But then I realized that his system must be using Apple's own trackpad driver. A driver will literally interrupt your program whenever it receives input, run its own code, and then return to your code where it left off. So if it doesn't properly restore registers, you can get this really odd behavior. So I wrote another test program that specifically tested whether MMX registers change value unexpectedly, and indeed, it triggered an assert as soon as the mouse pointer was used... Anyway, that was an extremely rare situation. Usually when applications behave differently between systems, there's a simpler explanation. One of the common things that can go wrong is buffer overflow. The operating system can place dynamically allocated buffers anywhere in memory. So 99 out of 100 times, the overflow overwrites some unimportant data, and the 100'th time the buffer is allocated right next to some crucial data which then gets overwritten and modifies the results of your calculation!
  14. C0D1F1ED

    Problem with infinite floating point values.

    Quote:Original post by oliii infinite values are usually because of a divide by zero somewhere. I've already shown that's not the case here. Very large values get squared and they overflow to infinity. Quote:and you should also check for (len > 0.0000001f) to avoid a divide by zero. With all due respect that's a terrible way to try to fix division by zero. Vector lengths smaller than 1e-6 can be perfectly valid, and floating-point values are accurate up to 3.4e-38. After that they become 'denormalized', which practically means you start losing precision but it's still not zero! The absolute smallest value that can be represented that is not zero is 1.401e-045 (for single-precision). In comparison, 1e-6 is enormous. So situations where division by zero can occur (not this one) really ask for a more thought through solution than comparing to some arbitrary small but not tiny value. More often than not you can actually eliminate the real cause of why the denominator is zero in the first place. And even in cases where that's not possible, skipping the division is very unlikely the right way to deal with it.
  15. C0D1F1ED

    Problem with infinite floating point values.

    The bug is on line 74, not line 68. camPos has very large values, and so on line 68 camMove gets very large values too. Single-precision floating-point numbers can only store values up to 3.4e38. Although camMove's initial values are still below that, if you square any of them it will overflow and result into infinity. That's what happens on line 74. Getting the length of the vector involves squaring each of its components. I suggest you find out why camPos is so large in the first place. The scale of your world is probably too large.
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!