Sign in to follow this  
Bejan0303

multithreading software renderer

Recommended Posts

Hello all!
I was think about way to multithread game engines and rendering. I looked at various designs online and didn't really like any, none of them really multithread rendering because the render context etc. (Correct me if I'm wrong). I was wondering if a well designed and thoght out multithreaded software renderer would be comparable performance wise to its hardware accelerate cunter part. I doo understand that the gpu is specialize of these calculations, but I figured 8 good workers can probably produce more than 1 awesome worker. I try and do some research on the topic but couldn't find anything. I was considering testing the idea myself, but wated some input from people smarter and more experienced than myself first.

Bejan

Share this post


Link to post
Share on other sites
A modern GPU is more like 1000 awesome workers than one awesome worker. Plus dedicated hardware to do various bits o f the rendering a general purpose CPU cannot get anywhere close to this performance.

Share this post


Link to post
Share on other sites
What do you mean by "the render context etc." Do you mean that since you generally don't want to render things at different frame rates, multiple rendering threads will have to sync up per frame, thereby negating their threaded impact? I agree that this is an issue with multi-threaded rendering. The thing is, the modern GPU is incredibly powerful. The CPU is often the bottleneck nowadays when blasting triangles through the card. What a multi-threaded approach is useful for is setting up all your vertices / textures / etc. (through physics-based mesh placement, shader-driven texture filtering, etc.) to render your scene with. If your physics code runs on it's own thread (processor), and your large-scale-terrain engine runs on another (if the two things are separable), then you've basically doubled the potential framerate of your code by multi-threading it.

A multi-threaded rendering approach would be good for software rendering. Maybe ray tracing? An interesting thing about that approach is that, if you break a scene into 8 different segments, and each segement raytraces a portion of the scene, you're essentially implementing a parallel processor (like a GPU) through multi-threading!

Share this post


Link to post
Share on other sites
Ah I see, I didn't realize it was so much better at its job. It just bothered me how I appears rendersing takes up the most cpu, but can't very well multithread it. Is there no good way to take adantage of the 4+ core on the cpu? Also then (not to sound ignorant) why do people still bother righting them? Just fun/learning experience?

Edit: I apologize funkymonkey I must have posted this initially around the same time as you and not seen your post. By rendering context I was referring to the inablilty to draw more then one triangle at once e.g. one thread attempts to draw its models, and the other thread tries to draw there's. At best I would think you would get triangles with vertices from both models. Even if the gpu and driver could differentiate the threads and keep the triangles correct, each thread would have to run the equivalent speed (I think this is what you were saying) which might not be so bad if the driver could keep everything straight I suppose. But with software rendering it would be designed around the idea of multithreading.

Even with the physics, ai, etc it seems like rendering takes most of the cpu time (~70% rendering, ~30% else) which can't be multithreaded, does that mean the only reasonable solutions are to just deal with the lack of balance, or do more physics, ai, whatevr else?

Thanks for the quick responses.

Bejan

[Edited by - Bejan0303 on December 5, 2010 12:54:03 PM]

Share this post


Link to post
Share on other sites
The experts can correct me if I'm wrong, but I don't think rendering takes that much of the CPU time in a modern game/renderer, and if it does it's simply due to batching issues and state changes. Which should be correctable with smarter scene organization and/or a thread-safe rendering context (DX11) so you can have multiple threads submitting commands at the same time.

[Edited by - doesnotcompute on December 5, 2010 1:21:24 PM]

Share this post


Link to post
Share on other sites
I'm no familiar with directx passed 9 really and hardly that I prefer opengl (not for any reason other than I learned it first it feels better to use) I'm not a fanatic about it either way. Does opengl havethe same ability in that regard?

Bejan

Share this post


Link to post
Share on other sites
Well that's annoying. I hope they are working on that. Well anyway. That's a lot guys, I just was curious about the idea. You know, think I could revolutionize the industry (jk). Thanks for the information.

Share this post


Link to post
Share on other sites
Currently with OpenGL the context is bound to a single thread; you can move the context between threads but it can only be 'current' in one thread at a time.

With DX11 you can create 'deferred contexts' which allow you to build command lists on multiple threads; however the final submission to the graphics card can only be done via a single thread (however you could interleave this with your next frames update step).

Share this post


Link to post
Share on other sites
As other's have said, even low-end, modern-ish GPUs (say, the last 4 generations or so) are more capable than even the fastest multi-core, multi-threaded CPU running the best software renderers available.

On something like a 4 core, 8 thread i7 processor from Intel, you might get something approaching Geforce 5x00 or ATI Radeon 9800 levels of performance and programability -- of course, to do that you've also eaten up most of the available compute and bandwidth resources. Examples of such renderers would include LLVMPipe, TransGaming's SwiftShader or Rad Game Tools Pix-o-Matic.

Graphics is a very parallel problem, "embarrasingly parallel" its often called, and as such GPUs have always evolved to be essentially multi-core/multi-threaded. New AMD cards operate on up to 1600 individual values at once (plus, other work going on in the card, like texture sampling or tessellation). NVidia handles somewhat less, but other factors like core-speed and dispatch efficiency make them competitive with (and surpass in some comparisons) AMD's performance (at the cost of a larger, more-complex chip that must be sold at a higher price.)

Then there is the issue of texture sampling, which really chews up bandwidth and can thrash the cache if the renderer is not careful. Even in Intel's failed Larabee GPU (which we may still see, some day) which was composed of up to 32 simple, pentium-derived CPUs, each paired with a massive, 16-wide vector unit (that's large enough to do a matrix-matrix multiply or transform 4 vectors in a single instruction) Intel choose to implement hardware texture-sampling -- the only part of the design that *wasn't* software running on those CPUs. The access patterns simply don't align with code or other data, so its difficult or impossible to create cache hardware which satisfies all these needs.

Share this post


Link to post
Share on other sites
Quote:
Original post by Ravyne
As other's have said, even low-end, modern-ish GPUs (say, the last 4 generations or so) are more capable than even the fastest multi-core, multi-threaded CPU running the best software renderers available.

On something like a 4 core, 8 thread i7 processor from Intel, you might get something approaching Geforce 5x00 or ATI Radeon 9800 levels of performance and programability -- of course, to do that you've also eaten up most of the available compute and bandwidth resources. Examples of such renderers would include LLVMPipe, TransGaming's SwiftShader or Rad Game Tools Pix-o-Matic.

Graphics is a very parallel problem, "embarrasingly parallel" its often called, and as such GPUs have always evolved to be essentially multi-core/multi-threaded. New AMD cards operate on up to 1600 individual values at once (plus, other work going on in the card, like texture sampling or tessellation). NVidia handles somewhat less, but other factors like core-speed and dispatch efficiency make them competitive with (and surpass in some comparisons) AMD's performance (at the cost of a larger, more-complex chip that must be sold at a higher price.)

Then there is the issue of texture sampling, which really chews up bandwidth and can thrash the cache if the renderer is not careful. Even in Intel's failed Larabee GPU (which we may still see, some day) which was composed of up to 32 simple, pentium-derived CPUs, each paired with a massive, 16-wide vector unit (that's large enough to do a matrix-matrix multiply or transform 4 vectors in a single instruction) Intel choose to implement hardware texture-sampling -- the only part of the design that *wasn't* software running on those CPUs. The access patterns simply don't align with code or other data, so its difficult or impossible to create cache hardware which satisfies all these needs.


The rendering itself is already parallel on the GPU side, which is where the rendering actually happens.

There's plenty of stuff to thread in preparing it, though.

Larabee was ditched due to rasterizing slowly but the real point behind it was ray tracing. It still had too complex processors to really up the processor count, though. Real problem is intel is just way too bloated for becoming a multithreading powerhous. They could succeed but they'd need to go with a design with hundreds of processors and they'd need to have project offset become developed enough to compete with current engines before they could even start to get things going.

Share this post


Link to post
Share on other sites
Quote:
Original post by thatguyfromthething
The rendering itself is already parallel on the GPU side, which is where the rendering actually happens.

There's plenty of stuff to thread in preparing it, though.


Sure, but that wasn't the question. the question was "could an extremely threaded and efficient software render compete with a hardware renderer?"

Quote:
Larabee was ditched due to rasterizing slowly but the real point behind it was ray tracing. It still had too complex processors to really up the processor count, though. Real problem is intel is just way too bloated for becoming a multithreading powerhous. They could succeed but they'd need to go with a design with hundreds of processors and they'd need to have project offset become developed enough to compete with current engines before they could even start to get things going.


Oh geez.

Larabee, I'm sure, wasn't brought into the consumer space for many reasons, but rasterizing slowly is subjective. Lets not forget that Larabee missed its release date by a year before it was cancelled as a consumer line of GPUs and AMD had come out with the fairly revolutionary 5x00 line in the mean time (Benchmarks were coming out around the time of the Larabee cancellation as I recall). Michael Abrash, who wrote the graphics driver for Larabee (essentially a really fancy, multi-threaded software renderer) as well as the commercial software renderer pix-o-matic and much of the original software rasterizer for Quake, has said, or at least implied that performance targets were met, and that the bigger issue was that performance/dollar among a primary competitor had seen a sudden uptick due to an unforseen technical jump and a strategy shift toward smaller-efficient GPUs (which also caught nVidia with their pants down), as well as their own delay. Larabee was probably cost-and-performance-competetive, or at least within striking range, had it been launched against the 4x00 series a year prior.

Processor complexity wasn't that much of an issue. The pentium-derived x86 cores accounted for something around 10% of the die as I recall, considering the LBNi vector units (the "shaders" essentially) as a separate entity. High-end Larabee chips were reported to have 32 such cores, and the vector units were 16-wide, so that's 512 single-precision calculations per cycle, which would have been just about right in the intended time-frame. They don't need hundreds of processors, they need throughput, doesn't matter how they get it.

Project Offset also has nothing to do with it -- firstly because it was built around a traditional rasterizer, having nothing to do with ray-tracing, and secondly because Intel had already quietly disbanded the team behind it and let the staff go.

We certainly haven't heard the last of Larabee -- its vector extensions are already on Intel CPU roadmaps, and it *was* launched into the HPC market as-was. I wager that we'll certainly see its ring-bus in future intel CPUs as the core-count increases, and we may even still see a consumer GPU some time in the future -- about twice as many transistors fit in a given area of silicon every time the process node shrinks, so core-count can double along with performance every 18-24 months assuming the renderer scales. Architecturally it should, unless the transport subsystems (bus, memory interfaces) get saturated.

Share this post


Link to post
Share on other sites
Did you mean that running the game code(like AI,Physics,etc)in the main thread,and running the render code in other thread?if so,like the unreal3 game engine,you can build a ring buffer,and then put all your render codes to it in you game thread.once finished putting codes,run the codes left in the game thread.the render thread will fetch the render codes,and call DireictX or OpenGL to render.that is all!

Share this post


Link to post
Share on other sites
I think it's possible to write a software rasterizer that is at least faster than lower end hardware "workers" in quite some situations.
Problems are random data access and texture filtering, which
1. is not supported by most SIMD architectures, SPUs support that by chained DMA which is quite an overhead to setup and the only really supporting ISA I know is larrabee.
2. random access needs to be done in a smart way, first several of those need to be gathered into one read, second: there need to be a way to continue cpu execution while the transfer is made.
3. texture filtering

But if you don't have that, e.g. no texture reading, cpus can be quite fast. e.g. rendering a z-pass, shadows, and even particles (textures in that case are frequently very magnified and you can sometimes getaway with just one channel e.g. for smoke, which makes filtering quite an easy job).

My rasterizer draws on an intel core with 2.67Ghz about 22MTri/s and about 30MTri/s on an SPU@3.2GHz. if you want to occupy more of the cpu, I would render different tasks of the rendering pipeline on each, e.g. one or two cores/spus render the zpass, some render dynamic shadows (e.g. 6 side cubemaps for point lights), some render particle into an offscreen buffer that could be blended at once with the scene.
Most of my licensees are using it for occlusion culling, tho. I've a specialised version which doesn't draw 100% correctly but the results are correct and it's even faster. This way you don't need to waste time on any portal system, or anti-portals or offline PVS calculations or artist tweaked "layer/zones/rooms/areas" that will be switched on/off by some hand set distances, it just works. It doesn't replace any GPU, but it can keep up with it and saves quite some stages in the pipeline later on (like streaming in textures that won't be visible anyway, setting up states, calculating particle, skinning....)


regarding larrabee, to some degree I agree with thatguyfromthething, Larrabee might be quite ok compared to an Pentium, but it's still way to bloated.
for example:
-swizzling, masking, shuffling; an NVidia GPU doesn't have that, you don't need that, it can be all done by working on SOA. but on the other side, an G80 has spend a lot of transistors on thousands of registers to keep on working to hide latency. you can't hide 400cycles of ram access latency with just 32 registers like on larrabee, not even if you utilize the L1 cache. e.g. on Fermi you'd not exceed 25% alu occupation if you would use the internal memory bank (which is way faster than most L1 caches).
-latency of vector instructions.usually, halving the latency means 3x to 4x of the transistors. Or the other way around, if you double the latency, you can have 4times the units on the same die space (that's not fully true, as you have overhead e.g. buses between registers and ALUs which increase linearly with the ALU count, no matter how big the latency of the ALU is). There is some paper on the net, analyzing the G92 from NVidia, I think the most expensive ALU instruction was taking nearly 2000cycles. But that's ok, ok as long as you can issue one instruction per cycle and if you have enough registers to keep the pipe busy. Even GPUs before G80 have hundreds of pixels in the flight; now compare that to larrabee, very few registers, that means, you can hide at most as many cycles as you have registers, if you can unroll all your loops, you can get maybe 20-30 cycles hidden, that means, Larrabee ALUs have to execute instructions in 20cycles latency to not to stall, compared to those 2000 peak of an G92, it's pow(20,2.53724); in cycles, if you rely on my "half latency, quad transistors", you'd need Pow(3Billion,6.43) of transistors on a larrabee to achive the same throughput like on an GF100/Fermi; intel would roughly need to produce a 1169.20Billion transistor monster to catch up, just because of the supoptimal architecture... not taking into account any overhead due to unneeded features like swizling, masking, shuffling... (again, this is by my "half latency, quad transistors"-rule, I've been told that by some hardware architect, I trust him).

One thing that really bites intel is that it's really hard to programm GPUs on the lowest level. it's even more complicated than programming SPUs. NVidia solves that partially by giving you just a high level access via cuda, you can't manage threads yourself, you can't group warps, you can't control the flow and you don't have to care about latency hiding, you don't manage what variables will be kept in registers and which will be spoilt into main memory to 'emulate' a stack.
Intel wanted to offer a simple programming model to ppl used to x86, as most programmers already cry when they have to manage DMA transfers on SPUs by themselves, that's paid by tons of transistors.

a simple comparision:
CELL with 8 SPUs : 230Mio transistors (~25Mio per core with;512register/core)
Intel NahalemEx 8Core: 2300Mio transistors (~250Mio per core;72register/core)
Intel Larrabee 32core: 1750Mio (~54Mio per core;2048register/core)
NVidia G80 8Core: 681Mio (85Mio per core;8192 register/core)
NVidia GF100 32Core: 3000Mio (94Mio per core;32768 register/core)
to make it compareable I calculated all 32bit registers and took the whole cpu package for transistor count (not excluding the caches, IO etc.) as those are needed to keep the ALUs running and contribute to the final cost.
Those numbers should demonstrate, that larrabee is quite good for an CPU, but GPUs spend their transistors no just on ALUs, but mostly on those parts that help to get the highest possible utilization of every part of it.

regarding rasterization on larrabee, I think it has some smart parts (e.g. binning on object level is a smart choice), but it went from one extreme (line walker rasterization) to the other extreme (hierarchically evaluation of triangle-pixel-coverage). I've of course implemented both, the later one just for evaluation; and while it's faster than line walking, it has an extreme overhead when triangles are getting smaller. NVidia seems to have a way smarter logic, (reference: http://www.youtube.com/watch?v=IqhRrySbRDs) which I've tried to implement using SIMD4 and it was quite a challenge, but proved, even in software, to be superior to the hierarchical way on tiny triangles AND on big triangles. I've of course no larrabee to prove that, but all other hardware I have agreed on this.

cheers

Share this post


Link to post
Share on other sites
Quote:
Original post by Ravyne
[...] Graphics is a very parallel problem, "embarrasingly parallel" its often called, [...]


May I specialize this phrase to "game graphics (and some other specialized graphics domains)". There are also graphics task like path tracing (or even more, metropolis light transport, where the path tracer, in some sense, walks between pixels) that are inherently incoherent, and there, a many core cpu might indeed do a better job than a gpu.

Jacco Bikker of ompf.org/forum and Dade of Lux Render could tell of their experiences trying to squeeze out the best of GPU and CPU for realistic image synthesis.

Share this post


Link to post
Share on other sites
Quote:
Original post by phresnel
May I specialize this phrase to "game graphics (and some other specialized graphics domains)". There are also graphics task like path tracing (or even more, metropolis light transport, where the path tracer, in some sense, walks between pixels) that are inherently incoherent, and there, a many core cpu might indeed do a better job than a gpu.
that sounds surprising to me, Path tracing is one of the most "parallel-friendly" algorithms I've ever seen and in my tests it scales linearly with the amount of cores (for simplicity I made a hardcoded scene with 8boxes, so I could port it from fpu->sse and fpu->g80 without any data layout struggle).
I think MTL exploits the coherence of lighting. The only problem I see is that you have to share one framebuffer and on older GPUs (like g80) you could have race conditions, where several "workers" try to modify the same pixel and there were no atomics at that time. But nowadays, with GF100/Fermi, you can make an atomic-add on the framebuffer.
In my MTL tests I didn't make it 'safe', all threads could modify any pixel they want, with 1024x1024 pixel, I expected the chances to be quite low that several permutations result in the same pixel on the screen AND try to modify it at the same time. I admit, I didn't port MTL to GPU, but just because I didn't expect anything special.
Or am I missing something?

Share this post


Link to post
Share on other sites
Quote:
Original post by Bejan0303
You know, think I could revolutionize the industry (jk).

You can.

With every generation, CPUs and GPUs are converging closer together. Multi-core and wider vectors drastically improve the CPUs throughput, while at the same time GPUs are trading compute density for more generic programmability. This means that one day you'll be able to write a software renderer that can compete with the GPU.

This may happen sooner than you might think, especially for IGPs. A year ago they were moved from the motherboard to the CPU package, but they're still separate chips. Next year, both Intel and AMD will lauch CPUs that have an IGP on the same die. The next logical step is to completely unify them.

The Intel HD Graphics IGP used in the upcoming Sandy Bridge processor generation will likely offer 130 GFLOPS of computing power. That's not a lot. In fact a quad-core Sandy Bridge will offer over 200 GFLOPS thanks to AVX. And the compute density will double again with the Haswell architecture.

The only thing preventing the CPU from performing well at gaphics is the lack of gather/scatter support. These are the parallel equivalent of load/store instructions, and without them things like texture sampling, raster operations, attribute fetch, etc. are not very efficient to implement. It's not that difficult for Intel or AMD to add support for gather/scatter though. Intel has already done it for Larrabee, and even today's CPUs already have half of the required logic (to support unaligned memory accesses). So sooner or later CPUs will support gather/scatter and this will make the IGP useless.

CPU manufacturers will still provide Direct3D and OpenGL drivers, but note that this hardware unification also offers a tremendous opportunity to develop your own graphics architecture in software! This will allow you to outperform the legacy graphics pipeline by having more control over all of the calculations and the data flow. And it allows to do things the GPU is not capable of or not efficient at.

So once these advantages start to take effect, GPU designers will have no other choice but to make their architecture fully generic as well (capable of efficiencly running very complex software written in C++). So they'll also resort to exposing the hardware directly, allowing software rendering technology to expand to mid-end and high-end systems.

All this means that now is the right time to learn multi-threading and develop vectorized software.

Share this post


Link to post
Share on other sites
Quote:
Original post by Krypt0n
Quote:
Original post by phresnel
May I specialize this phrase to "game graphics (and some other specialized graphics domains)". There are also graphics task like path tracing (or even more, metropolis light transport, where the path tracer, in some sense, walks between pixels) that are inherently incoherent, and there, a many core cpu might indeed do a better job than a gpu.
that sounds surprising to me, Path tracing is one of the most "parallel-friendly" algorithms I've ever seen and in my tests it scales linearly with the amount of cores [...]
Or am I missing something?


Number of (independent) cores let it scale (near-) linearly. But path tracers are not data-parallel, the ray tree quickly diverges into all possible directions beyond the primary bounce. Therefore, a GPU and modern SSE extensions to CPUs, which are optimized for data parallel tasks with no branching quickly become disadvantageous for higher level parts of the ray tree.

You can get forward a bit by out-masking rays in a ray-package, but even a whitted style ray tracer soon looses parallel advantage for scenes with a lot of perfect mirrors.

A lot of effort must be put in to keep data parallel, e.g. using deterministic quasi monte carlo methods, but consider scenes with lots of small patches and different BRDFs, like found in real life.

Put multiple scattering volume rendering into the mix, where each scatter may yield a full ray trace in itself, and the incoherency is perfect.

Hence: Many cores (Multicore CPU) -> Good. Wide vectors (GPU, SSE) -> not good.

[Edited by - phresnel on December 7, 2010 4:47:56 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Ravyne
On something like a 4 core, 8 thread i7 processor from Intel, you might get something approaching Geforce 5x00 or ATI Radeon 9800 levels of performance and programability...

On a Core i7, SwiftShader outperforms a Radeon 9800. It also offers Shader Model 3.0 capabilities, while the 9800 is limited to Shader Model 2.0.

Share this post


Link to post
Share on other sites
Quote:
Original post by phresnel
Quote:
Original post by Krypt0n
Quote:
Original post by phresnel
May I specialize this phrase to "game graphics (and some other specialized graphics domains)". There are also graphics task like path tracing (or even more, metropolis light transport, where the path tracer, in some sense, walks between pixels) that are inherently incoherent, and there, a many core cpu might indeed do a better job than a gpu.
that sounds surprising to me, Path tracing is one of the most "parallel-friendly" algorithms I've ever seen and in my tests it scales linearly with the amount of cores [...]
Or am I missing something?


Number of (independent) cores let it scale (near-) linearly. But path tracers are not data-parallel, the ray tree quickly diverges into all possible directions beyond the primary bounce. Therefore, a GPU and modern SSE extensions to CPUs, which are optimized for data parallel tasks with no branching quickly become disadvantageous for higher level parts of the ray tree.

Tree traversal is still parallel, but I think I understand what you want to say. Data is scattered quite randomly, which makes reads incoherent and therefor slow. but still, a gpu has very fast memory, 150GB/s compared to ~20GB/s of modern cpus. really random access can compete quite well (except if you have some cache monster like Itanium or Power7).
You are right, that cannot be easily ported to SSE, because those instructions are not "vector complete", but SPU, LarrabeNI and all GPUs are, you can easily convert scalar code to be SIMD with opencl/cuda.

Quote:

You can get forward a bit by out-masking rays in a ray-package, but even a whitted style ray tracer soon looses parallel advantage for scenes with a lot of perfect mirrors.
you probably mean again the scattered data access, but again, I don't see an advantage of CPUs in here, they will also have scattered access. and while CPUs will at some point stall (quite soon compared to GPUs), GPUs will continue to execute a lot of rays in parallel till the scattered data reads finish for other threads.

Quote:

A lot of effort must be put in to keep data parallel, e.g. using deterministic quasi monte carlo methods, but consider scenes with lots of small patches and different BRDFs, like found in real life.
yes, you're right. it's a pitty that you cannot manage threads on GPUs yourself for better cache-thread co-work.

Quote:

Put multiple scattering volume rendering into the mix, where each scatter may yield a full ray trace in itself, and the incoherency is perfect.

Hence: Many cores (Multicore CPU) -> Good. Wide vectors (GPU, SSE) -> not good.
I really have to disagree in that one, you're right for SSE, due to a miss of scattered reads. but path tracing (and MLT as special version of it) is a very data parallel task, although the data is scattered. as a result the same code path is walked by a lot of threads, just the data you read is not coherent, but that's the case for both, gpu and cpu. I don't see an advantage of CPUs in here, except if you design a special thread management that takes cpu caches into account.(I think we didn't imply that in here, or?).



Share this post


Link to post
Share on other sites
I am afraid I don't see where ray tree traversal can be a highly data and control-flow parallel task. Some paths stop after the second recursion, as the BRDF and a game of russian roullette decided so, some paths stop after the 50th recursion, likewise reasoning.

Without russian roullette, you are either introducing bias, or become a victim of the halting problem, as ray tracers are turing machines. With russian roullette, you have no data parallelism upon neither the ray tree (which is more a virtual term, as most ray tracers do not actually store the ray tree) nor the scene.

Of course there are tasks within the ray tracer that can be exhibited vectorized, but I've personally never heard of a *full* production path tracer (or derivative) that runs *solely* on the GPU and supports all the nice features (Renderman, e.g.).


Or did I miss the moment where you could write GPU programs with ifs, whiles, fors, gotos, returns, breaks, without penalty?

Share this post


Link to post
Share on other sites
The discussion is awesome, but from the looks of it I think the OP was merely using immediate OpenGL rendering (sending each triangle individually) and wondered why so much CPU was spend on "rendering".

Bejan0303: Look into vertex buffers, where the GPU can draw several thousands triangles with a single command. This will allow your program to continue while the GPU is rendering in parallel.

Share this post


Link to post
Share on other sites
hello sorry I haven't posted in a bit (not that I have a vast knowledge base to contribute anyway) semesters coming to a close, so homework is boggng me down. . . Anyway. Yeah my basic question was could a well designed multithreaded software renderer compete with todays hardware renderers. Especially in terms of games, from what I understand movie rendering is in software anyway, and I'm more interested in game prgramming.

Also, I definatelly use buffers and don't consider immediate mode for much really, not that I'm an expert, I think the make more sense anyway . I think imediate mode is meant as training wheels and is great for learing though. I do realize that after the draw calls, you can continue to calculate the next frame before the Present() or SwapBuffers() but It seemed to me (not through hard tested evidence) that my code renders more than most of anything else, which is fine. I currently have a quadcore processor, and was considering the design of a game engine. I didn't appreciate not being able to dump a bunch of triangles from each thread into the renderer, while with a custom software renderer you could clip things into screen spaces, dividing by cores or something. I have not ever tried designing/coding a renderer, so I have no idea how complicated it is.

Also, I don't think a software renderer would really need to use "shaders" like glsl, hlsl, etc. Since you could just code them right in. Even if you didn't want them a direct part of the renderer, you work some "shader" object class that you activate at an appropriate time. It can be pure c/c++ or whatever other language someone thinks is better...

Also I really enjoy all the info about the gpu hardware and such. I didn't know a lot of the specifics and didn't realize the difference in terms of output between a general purpose cpu and a gpu was so great.

Bejan

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this