Intel's Larabee Card

Started by
12 comments, last by phresnel 14 years, 11 months ago
Is anyone else as excited about these guys comin out as I am? I'm really look forward to workin with them... Does anyone know how they are supposed to compare with regular video cards though for video processing? I know part of their appeal is that they are full x86 / x64 compatible and are good at more than just graphics, but I'm wondering - is it bound to be much slower than a regular card of the same generation?
Advertisement
I am excited, but not for its rasterization power, wich I fear will be far behind the upcoming generation of nVidia and AMD/ATI. I'm mostly interested in what its raytracing capabilities will be, since the code that runs on a x86 will run on LRB as well.
That said, I still have to figure out how they think to scale LRB in the future: GPU get speed improvements changing architecture, but a x86 CPU doesn't change too much, so the only way I see to improve performance every year or so is by increasing processing units or frequency, wich lead both to heat and consume issues...
personally i see it for what it is... another useful addition to the toolset. Given my daily programming i would use it for.

* Tools
* Post Processing
* Physics
* Particles

likely many many other tasks that involve repetitive self contained operations.
Quote:Original post by cignox1
I am excited, but not for its rasterization power, wich I fear will be far behind the upcoming generation of nVidia and AMD/ATI. I'm mostly interested in what its raytracing capabilities will be, since the code that runs on a x86 will run on LRB as well.

Don't be so sure. Keep in mind each LRB core DOES have that (functionally complete!) 16-way SIMD in addition to the normal x86 functions. You wouldn't take, like, Quake's renderer and run it on LRB just because Quake ran on an x86. You'd want to redesign it so that it really uses the SIMD that LRB has, e.g. http://www.ddj.com/hpc-high-performance-computing/217200602.

Personally, I actually think it will be competitive compared to NV and AMD when it comes out. Some aspects of it are slower, and the raw FLOPS compared to NV/AMD's next chips are going to be higher, but having the renderer behave completely differently (i.e. the fact that it's a TBDR instead of an immediate-mode renderer) will probably give some very good performance numbers, especially if the rendering code of a game in question is tweaked (or reworked) to run better on LRB. Not only that, but it may turn out that in practical cases, LRBs FLOPS will be "better" than NV/AMD's, due to stuff like a WAYYY higher cache size per LRB unit.

Quote:That said, I still have to figure out how they think to scale LRB in the future: GPU get speed improvements changing architecture, but a x86 CPU doesn't change too much, so the only way I see to improve performance every year or so is by increasing processing units or frequency, wich lead both to heat and consume issues...

GPUs have been getting speed improvements from adding more processing units all of the time, and increased the frequency of the chips all of the time. Heck, most of the time the arch of a specific series of chips is exactly the same, just with different numbers of units, clocks, or memory bandwidth. LRB will be no different in that regard, and just like GPUs, they'll be able to add more processing units by upgrading the manufacturing process every year or two.
Tim Sweeney sure is ;)

Graphics rendering wise - It's definitely interesting. However, I think it will take a generation (or two... or more) before its power can really be exploited. Yes, it will be a powerful chip. However the fact that its power comes from the fact that it has “many” cores means that a lot of thought and planning will be needed for writing renderers for it. Multithreading is easier said than done, especially for a rasterizer - and especially if you want to do it right. You have a lot of hardware threads to make effective use of (is it one, or two threads per core? I can’t remember.)

That said, Intel is making a big push for multithreading education - which is a very good thing. They have released a number of papers trying to encourage developers. And there are some pretty smart guys working on the Larrabee project. Console, especially the Playstation 3 also has kicked people into gear with its SPUs forcing people to get in the multi-threading mindset. And let’s not forget our “plain old” dual and quad core machines. ;)

But aside from programming issues, I can see a lot of developers approaching it with caution. Think of how long it took for our trusty GPU to become main stream. Id took a big gamble with Quake 3 requiring hardware OpenGL, which was a great thing, but can you see developers doing that any time soon with the Larrabee? Look at PhysX...

Speed wise, I remember seeing some previews that showed it lagged behind the current GPUs of the time, but the reviewers were still impressed with it.

But yes, the possibilities outside of just rasterizing (or raytracing I guess) graphics is quite exciting. It could slot in as the GPU’s right hand man, taking care of things like Andy Firth mentioned like post processing and particles (Killzone 2, I believe, does post processing on one of the SPUs on the PS3 to free up the GPU). Those doing software occlusion culling will be able to offload this to the LRB, which is definitely cool (guess it wouldn’t be software occlusion culling anymore.. :D). It could be used to do skinning, so no more limits on bones/weights/etc... I could go on for pages of the “little” things it could take care of.

If it gets widespread adoption then our lives will be a lot more interesting :)
[size="1"]
Quote:Original post by Cypher19
Don't be so sure. Keep in mind each LRB core DOES have that (functionally complete!) 16-way SIMD in addition to the normal x86 functions. You wouldn't take, like, Quake's renderer and run it on LRB just because Quake ran on an x86. You'd want to redesign it so that it really uses the SIMD that LRB has, e.g. http://www.ddj.com/hpc-high-performance-computing/217200602.

Of course I didn't mean that you can take the source, recompile it and everything goes fine. I was just refering to the fact that LRB will be much similar to the well known CPU's that most code is written on, so it shuould be reasonably easy to adapt the Quake source to run on it.

Quote:
Personally, I actually think it will be competitive compared to NV and AMD when it comes out. Some aspects of it are slower, and the raw FLOPS compared to NV/AMD's next chips are going to be higher, but having the renderer behave completely differently (i.e. the fact that it's a TBDR instead of an immediate-mode renderer) will probably give some very good performance numbers, especially if the rendering code of a game in question is tweaked (or reworked) to run better on LRB. Not only that, but it may turn out that in practical cases, LRBs FLOPS will be "better" than NV/AMD's, due to stuff like a WAYYY higher cache size per LRB unit.

I think (hope?) that too :-)

Quote:
GPUs have been getting speed improvements from adding more processing units all of the time, and increased the frequency of the chips all of the time. Heck, most of the time the arch of a specific series of chips is exactly the same, just with different numbers of units, clocks, or memory bandwidth. LRB will be no different in that regard, and just like GPUs, they'll be able to add more processing units by upgrading the manufacturing process every year or two.


Yes, but this is just one way to improve GPU performances. Instead, adding cores or increasing frequency to LRB will be much harder, as we saw several times for CPU. Making it short, I don't think the will be able to double LRB performances every 18 months or so, because the only way I see to do that is by adding cores and increasing frequency, and both have shown their limits.
Of course, they can also change architecture, but this usually does not increase the efficency too much, and requires more than 18 months.

But I will more than happy to be proven wrong :-)

Just like with all the other tech intel adds (beside mmx) they just take the game industry for marketing.

Honestly SSE up to now lacks most simple but important instructions to be usefull for games, but in sense of super computing capabilities (like working with two doubles, especially for complex numbers) they are just perfect.

Larrabe is the same.

If they wanted to make something that is good for gaming, they would design something similar to ATI and NVidia, heck, they have it already, the most gfx on earth are from intel. Even those on-board GFX are massive parallel, they are progamable with c and assembler, why does intel need another "GPU"-chip?

yes, it's for super computing. you can sell the same chip for HPC at 10x the price just by relabeling it to xeon/tesla/opteron.

So, lets take a look at the basic architecture of larrabee:
- what do you need a full synchronized cache across all 32(?) cores? really, if you need to break up physics, gfx etc. to work on 32cores, you will have no data dependancy. what is the need of this slow cache architecture? you won't run simple unoptimized programms on larrabe, every nowaday available quadcore can outperform it otherwise (especially if it's ad-hoc multithreaded). So you'll have your physics that would run a way faster with a cache that is not synchronized, but has a smaller latency.
what's the need of it? It's needed to run some HPC software that relys on memory consistancy (even across several servers).
- why no sratchpad memory instead of big caches? why not use something ultra-fast, dedicated for every core? look at SPUs or Cuda/Geforce, they have local storage (on geforce even a way smaller) that is dedicated for local storage.
- why taking some ultra old instruction set instead of highly optimized gfx opcodes/instruction set? Intel has those hardware for ages, check out http://www.cs.unc.edu/~olano/papers/2dh-tri/2dh-tri.pdf "Graphics Processors are Intel i860 microprocessors, Renders are 128x128 custom SIMD arrays". why dont they just make those open source since ages? why don't they open up their current full d3d10/d3d9/opengl compatible on-board chipsets?
It's all to compile old sources to run on their 16xSIMD. They will give you a compiler that runs every old code and you can optimize some code, so you can even run all those fu**ing old handoptimized assembler routines. for game developers this doesn't matter, we will have to write our stuff from scratch anyway. we dont use intel integrated performance promitives (tm) that are optimized by more ppl at intel than there work on the rendering.
- But there are texture samplers ftw? no, they are seperate to not polute the core SIMD instruction set with instructions that would be dedicated for gfx. HPC does not need DXT1 decompression. they even dont care about 32bit float interpolation. From game developing perspective it would be smart to have those SIMD instructions otherwise there will be always either the "ALU" or "TEX" unit not on full load, so having software rasterization that is heavy on texture sampling, it would be way smarter to use all those 32Cores to sample textures. and If you draw just shadows, it would be smarter to use the silicon not for idling cause there are no textures to sample, but to rasterize/transform.
But hey, you can just make some "HPC" chips replacing texture units with real cores.
And guess what, I bet with you, they cache of those texture units wont be synchronized across the whole memory like the normal cache. it's of no use for gfx.


"but it's the best...!!!"
no it's not. alternatives:
-PSP MIPS VPU, just look at the instruction set, from vector-dots, over quartanion instructions up to matrix*matrix in one instruction. that's what is useful for making games. ( http://mrmrice.fx-world.org/vfpu.html )
-CELL SPU, not a beauty for game programming, but if you really want some raw simple powerful cpu that just does SIMD, nothing else but that, this is where you should stick. once it's released with 32cores like some paper from IBM claim and higher clock, it will have also a lot of computation power, with even finer granularity, which is more important for raytracing than pure math.
-GPUs, with cuda etc. you can access already all the power that larrabe will have. you can already "enjoy" all the restrictions of writing code that has to follow the same code-paths if you want efficiency and with a way less headache, cause the compiler hides all that.

Really, I'm not excited by intel's larrabee. They just try to sell their bull as racing horse. it might be good for some other stuff, but not for races/game/graphics.
Most software I've seen that ppl try to optimze is memory limited. Sure they add simd, some old carmack float2int tricks, some lookup table and get 50% speed and are happy, but they are not even aware of it. Counting instructions, counting cycles, but not caring about some movs that, in case of a cache miss, can take up to 500cycles on simple intel CPUS. and now guess what happens on some in-order cpus. yes, you will spend most time optimizing for memory (if your smart enough) and you'd get away a lot better with an dedicated memory management like on SPU (or VU). adding prefetch instructions all over the place will hurt the memory controller a way more to transfer your cacheline than some dedicated 16kb dma transfer would (if you take the scope of many cores into account). you don't belive that? look at the headache intel is accepting with their deferred rasterizer, just to reduce the memory issues they'd have with a simple forward renderer.


no, that's not a flame ;) (or is it? *haha*). but if you want to have an estimation of how much you'll gain with larrabee, just implement your rasterizer or raytracer or... with SSE. you'll soon see how disappointing it is. implement the same for an VPU. you'll know what to expect from larrabe.

but try also to implement some fluid simulation, I bet larrabbe will be real joy ;)
Quote:mightypigeon
Multithreading is easier said than done, especially for a rasterizer


Also, don't forget that Intel is interested in ray tracing techniques, and ray tracing is, by basic principle and in many implementations, embarrassingly parallel to the level of the pixel; that is, you could trivially subdivide the rendition of an image of size X*Y into X*Y threads, assuming that your routines are mostly statefree.

I am currently writing a CPU heightmap/implicit heightmap ray caster (for kicks and for training), where I use a concept similar to wave surfing, which introduces some state per vertical line. Having such state, I could still share the rendering trivially, assuming an image of size X*Y, among X threads. Intel's TBB comes to mind.

Another idea is offline rendering of motion picture. Not directly related to games, of course, but still an interesting topic: The rendition of an animation can (often, not always) easily be multithreaded by simply letting the n-th core render every n-th frame.


Btw, there are tons of detail here: ompf about LRB.

Post-Disclaimer: I am not deep into the topic LRB, so I stop here :D
Quote:Original post by phresnel
Quote:mightypigeon
Multithreading is easier said than done, especially for a rasterizer


Also, don't forget that Intel is interested in ray tracing techniques, and ray tracing is, by basic principle and in many implementations, embarrassingly parallel to the level of the pixel; that is, you could trivially subdivide the rendition of an image of size X*Y into X*Y threads, assuming that your routines are mostly statefree.
by threads, you mean those 32 or 128 cores that you could use with larrabee? or are you calling every pipe of the SIMD16 a 'thread' (kinda like nvidia does with the g80) ?
cause, while I would agree with you on the pixel:thread idea, how do you want to get benefits from massive simd? My concerns are more toward random memory access thant to the math (as that could be easily masked out). especially secondary rays are not that coherent anymore. In that case some very simple (even non-SIMD) core would be more ideal IMO.


Quote:
I am currently writing a CPU heightmap/implicit heightmap ray caster (for kicks and for training), where I use a concept similar to wave surfing, which introduces some state per vertical line. Having such state, I could still share the rendering trivially, assuming an image of size X*Y, among X threads. Intel's TBB comes to mind.
there is some similar work done for SPUs that might interest you :)
http://www.power.org/resources/devcorner/cellcorner/CellTraining_Track1/CourseCode_L2T1H1-13_CellApplication_Affinity_mod.ppt


Quote:
Another idea is offline rendering of motion picture. Not directly related to games, of course, but still an interesting topic: The rendition of an animation can (often, not always) easily be multithreaded by simply letting the n-th core render every n-th frame.
offline renderings usually have to handle tons of data per frame, I think that might be maybe some limitation (if you assume you have 128threads running, so 128x the per-frame memory).


Quote:Original post by phresnel
Also, don't forget that Intel is interested in ray tracing techniques


Definately. They have been pushing that quite a bit lately too. But are game developers go down that path? Or will they stick with rasterizing?

Or go with something completely different, like voxels? Heh, I guess we'll have to wait and see.



[size="1"]

This topic is closed to new replies.

Advertisement