Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 15 Dec 2001
Online Last Active Today, 02:36 PM

#5174826 So... C++14 is done :O

Posted by phantom on 19 August 2014 - 02:37 PM

A small number of developers push back when they see new things in code, but most will eventually give in. The language has changed, and they need to either adapt or die (figuratively).  I have not seen any serious resistance to adopting the features when they are available on all the target platforms.

Annoyingly at the last place I worked I did see this, with arguments (without seeing code I might add) that C++11 features are 'too easy to misuse' or 'ugly' or 'hard for new people to understand' (although my reaction to that last one was 'they need to keep up with the language then'). I was pushing for adoption (mostly in C++ tools, with some features as the compilers allowed, MS being the limiting factor) but it felt like hard going.

New company is actively picking up C++11 features as they make sense/are supported; the main barrier for non-platform specific code using C++11 features was Android, but with moving across to Clang the limit is pretty much, again, the MS compilers which dictate feature level.

The difference seems to be 'older, battle scared, comfortable' developers vs 'keen and punchy' developers (not young, by any means, but just more energy for things) and that itself might come down to the company in question and the mindset of those working there.

#5174652 So... C++14 is done :O

Posted by phantom on 19 August 2014 - 03:24 AM

C++14 would frame that more like;
std::for_each(std::begin(Values), std::end(Values), [](auto value)
and once you get use to reading that then there is always this fun setup...

std::erase(std::remove_if(std::begin(values), std::end(values), [](auto value) { return !value->alive();}), std::end(values));

#5174647 So... C++14 is done :O

Posted by phantom on 19 August 2014 - 03:09 AM

Why would you write the loop the 2nd way when you have std::for_each and lambda functions?

#5173053 Can't solve problems without googling?

Posted by phantom on 12 August 2014 - 04:34 AM


When at work, when confronted by a problem my first thought it ALWAYS "I wonder if anyone has had this issue and solved it already..." and away to google I'll go to see if that's the case and to learn from something someone else has already solved.

The 'learn from' part is pretty key; if you find a solution, copy and paste it and don't take in why it worked then you've learnt nothing. If you take a moment or 7 to think about the solution, how it works and how to refine it to your situation then yay! experience!

And that's where the ability to solve 'new' problems comes from; experience.

Most seemingly new solutions will always have a seed in an existing solution to an existing problem which someone then thinks about some more, applies knowledge and experience gained over the years to refine the solution and create something new.

Nothing is created in a vacuum and all new ideas are feed from old ideas and as you gain more knowledge you'll find yourself doing this naturally.

Problem solving itself is nothing more than looking at something and then taking it apart so that one big problem because two smaller ones, which in turn become two smaller ones and so on. This gets easier with experience and practice as you get use to it and solve problems.

The key point is 'research' isn't a dirty word.

#5171319 Where to store container of all objects/actors/collision models/etc

Posted by phantom on 03 August 2014 - 04:55 PM

Why is having Tree and Missile be completely separate classes (that both have the common functionality of being in a location and needing to be drawn to the screen) a better solution?

Because you start coming up with crazy inheritance structures just to add some extra functionality.

Lets say you start with something which needs to be drawn; Actor { Vector2 position; Sprite* graphic; }
But now you need one to be able to have collisions in some cases; CollidableActor : public Actor { BoundingBox box; }
But not you need one to also be moveable in some cases; MoveableCollidableActor : public CollidableActor { void Move(); }
But now you need one to also be able to fire; ShootableCollidableActor : public MoveableCollidableActor { void Fire(); }
Oh, but now you need a tank; Tank : public ShootableCollidableActor { void Fire(); void Move(); /* because tanks move differently */}
Oh, but now you need a tank which also has rockets; RocketTank : public Tank { void FireRocket(); }
And now I want a Jeep with no gun; Jeep : public MoveableCollidableActor { void Move(); /* because Jeeps move differently */}
And now I want a Jeep with a gun; GunJeep : ShootableCollidableActor : public MoveableCollidableActor { void Fire(); }
And now I want a turret which doesn't move and ooops.. all my shootables are moveables...

At which point you have a ridge hierarchy and things in the wrong place.
If you try to resolve it by moving 'shoot' down and 'move' up then you end up with anything which can move must shoot, but our Jeep doesn't so that's wrong.

This is why the advice is that you should prefer composition over inheritance by default.

In this case;
Actor { Vector2 position; Sprite * graphic }
Tank { Weapon * primaryWeapon; Actor *; CollisionObject * collision; MoveLogic * movelogic; }
RocketTank : Tank { RocketWeapon * secondaryWeapon; }
Jeep { Actor *; CollisionObject * collision; MoveLogic * movelogic; }
GunJeep : Jeep { Weapon * primaryWeapon }
Turret { Actor *; Weapon *; CollisionObject *}

There is still inheritance there but they are short chains (Jeep -> GunJeep, Tank -> RocketTank, MoveLogic -> {TrackedMove, WheelMove}) and are not bound so if you wanted to make a new object like a helicopter it is easy to plug it in and give it more weapons.

Chopper { Weapon* weaponArray[4]; Actor *; CollisionObject * collision; MoveLogic * movelogic; }

So we could build a Chopper with 2 tank guns and 2 rocket guns without having to worry about their firing logic.

(There is also an added bonus that these entities don't have to do their own processing; a container somewhere could hold all active CollisionObjects and process them at once, this means nice cache bonuses for code and data).

#5167042 Directx Shader calcualtions.

Posted by phantom on 15 July 2014 - 01:33 PM

Simplest way?
Check thread id; if thread id = 0 then loop over the results, add them up and write them out.

Probably the better performing way; Do a reduction in stages.
If you have, for example, 64 outputs from the first pass then during the summing phase;
- Thread 0 adds result[0] + result[1]
- Thread 1 adds result[2] + result[3]
- Thread 2 adds result[4] + result[5]
and up to thread 32 (rest do nothing)

then repeat, each time halving the thread count doing the summing; your final calculation will be the result.

#5166690 hyphotetical raw gpu programming

Posted by phantom on 14 July 2014 - 03:54 AM

when speaking about sheduler, i understand that those 64 big threads are managed by those sheduler? here i do not understand or at leas im not sure - i may suspect that this sheduler comes between workloads and those 64 big threads
I may suspect that each workload is seperate assembly program
and threads are dynamically assigned to those workloads, maybe that could have some sense

The CU scheduler dispatches work to SIMD units; those SIMD units work in groups of 64 threads, 16 at a time, as described in my reply above.
The workloads can be separate or the same programs, depending on the work requirements; You could have 40 instances of the same program running, or 40 different programs working on the same CU, split across the 4 SIMDs.
The threads are not dynamically assigned; an instance of the program is assigned to the SIMD at start up, registers are allocated and that work will always stay on that SIMD unit and will always execute in banks of 64 threads (You can ask for less threads but that just means that cycles go to waste as the difference between what you asked and the multiple of 64 thread count which is closest, but bigger, is going to be ignore. So if you only dispatch 32 work units then 32 threads are going unused. If you dispatch 96 then you'll require two groups of 64 threads to be dispatched and again 32 will go unused).

All allocation of workloads and registers is static for the life time of the program. 

this is different than i thought becouse this involves this input assembly to be defined on some width of data, i mean not some normal scalar assembly but some width-assembly
yet my oryginal question was how those input assembly routines are provided for execution and also how results are taken back, (there must be some way some function pointers interpreted by hardware as routines to execute or something like that)

There is no 'width assembly' (beyond the requirement to enable 64bit float mode, but that would be a mode switch in the instruction stream itself) as all SIMD units are scalar; vector operations in GLSL/HLSL/OpenCL are decomposed to scalar operations and these are what the SIMD units see. The number of workgroups required is handled outside the CU at the GPU command processor stage where either the graphics command processor or async compute engine consumes instruction packets to setup the CU to perform work.

The work is provided by the front end command processors which consume their own instruction stream.
The process for setting up an execution would look something like this;
- host DMA's program code into GPU memory
- command inserted into command processor's instruction stream telling it where to find the program code and the parameters for it
- command processor executes instructions to setup workgroup and dispatch work to CU
- CU scheduler is given data (internally routed) which includes address of program code in GPU memory
- CU scheduler assigns this address as the instruction pointer to the SIMD that will deal with it
- CU scheduler then schedules instructions from any SIMD workloads it has internally

This is very much like how a normal CPU works in many regards; in that an instruction pointer is loaded and execution proceeds from there; the only difference is the program has to be uploaded by a host and then two schedulers are involved in dispatching the work (first as a group and then at a per-instance level).

To get the results back to the host you'd have to copy them back from GPU memory, either via a DMA transfer or by having the memory in the CPU's address range and accessing directly.
Either way you'd get whatever the gpu wrote out.

The GPU can also send back details to the host via a return channel/memory stream which allows you to do things like look for markers and known when instructions are complete so you know when it is safe to operate on the memory.

I am also curious what it is with results, if i provide three workloads can i run them asynchronously then get a signal
that first is done then use the result as an input for some next
workload etc - I mean if i can build some pre scheduler loop
that constantly prowides workloads and consumes the results - that was the 'scheduling code' i had somewhat on my mind
- is there something like here to run on gpu or this is just to write on cpu side?

In theory, if directed to the right front end command processor, then yes.
The async command processors in the GCN architecture can communicate between each other which would allow you to setup task graphs between them to do as you say; this would be done using flags and signals in memory and have each ACE waiting on and signally the correct one.
However last I checked this wasn't currently exposed on the PC.

#5166687 hyphotetical raw gpu programming

Posted by phantom on 14 July 2014 - 03:29 AM

what is SIMD unit, what is its size is it float4/int4 vector on each simd? I may suspect so but i cannot be sure ;
further there is saying about 10 wavefronts, why 10? how many CUs is in this card?
also i dont understand what means simd unit, and what does mean thread here, when speaking about 4 groups of 16 simd units it is meant that there are 64 'threads' each one is working on float4 (or int4  i dont know) 'data packs'

A SIMD unit is the part of the CU which is doing VGPR based ALU work. So if you issue an instruction to add two vectors together this is the unit which does the work.
The SIMD units are scalar in nature however; you have 16 threads which execute the same instruction at the same time but the data is different. Each one is working on a 32 bit float or int, or 64bit double, during this work. This means that vectorised work requires more clock cycles to complete as they are done as separate operations. So a vec2 + vec2 would take 2 instructions per thread to complete (x+x & y+y).

A 'thread' is an instance of data grouped together; they aren't quite the same as CPU threads because CPU threads operate independently where as on a GPU you'll have a number of threads executing the same instruction (64 on AMD, 32 on NVidia are typical numbers). So instruction wise they move in lock step but data wise they are separate.

The 4 groups of SIMD units mean just that; you have 4 groups of 16 threads which are operating on different wavefronts independently. Work is never scheduled across SIMD units and once assigned to a SIMD unit it won't be moved off.

Each SIMD, per wavefront, works on 64 threads at a time; as it has room for 16 threads to be executing at once this means that for any given instruction it takes at least 4 clock cycles for it to complete and for more work to be issued. So, returning again to our vec2 + vec2 example this would take 8 clock cycles to complete (assuming 32bit float);

0 : Threads 0 - 15 execute x+x
1 : Threads 16 - 31 execute x+x
2 : Threads 32 - 47 execute x+x
3 : Threads 48 - 63 execute x+x
4 : Threads 0 - 15 execute y+y
5 : Threads 16 - 31 execute y+y
6 : Threads 32 - 47 execute y+y
7 : Threads 48 - 63 execute y+y

Note; this might not happen like this as between cycle 3 and 4 an instruction from a different wavefront might be issued so in wall clock time it could take longer than 8 cycles to complete the operation.

While this is how the GPU works internally we conceptually think of it as all 64 threads operating on the same instruction at the same time; because nothing can pre-empt the work during an instruction being operated on over the 4 clock cycles you can treat it as if all 64 operations happened at once as the observable result is the same.

The number of wavefronts per SIMD is simply a case of that's how the hardware was designed; probably a case of AMD did some simulation work and between that and the number of transistors required to support more 10 was probably the sweet spot. It also makes sense given the nature of the scheduler as it can dispatch up to 5 instructions per clock from 5 of 10 wavefronts, this means that theoretically have twice the required number of threads 'waiting' to issue work than can be serviced but this also means that if threads stall out you have others waiting to take over. If, for example, wavefront 0 is waiting on data from memory to come in and can't issue work then there are still another 9 to choose from to try and issue all 5 instructions from.

The number of CU depends on the cost of the GPU; a top end 290X will have 44, others will have less; this is just a function of the cost of the hardware, nothing more.

#5166468 hyphotetical raw gpu programming

Posted by phantom on 12 July 2014 - 04:07 PM

got no time and skills to read many docs but will try just to 'deduce' something from this info - it is worse but consumes less time, than in futre i will try to read more

You might as well stop until you've got the time then; my initial explanation contains pretty much all the details but you are asking questions which are already answered, you just lack the base knowledge to make sense of them.

Your comparison with CPUs is still incorrect because a CPU only schedules instructions from a single stream per core/hardware thread; it requires the OS to task switch. A GPU is automatically scheduling work from up to 5 threads, from a group of 10, per clock BEFORE the instructions are decoded and run on the correct unit - and that is just the CU level.

You REALLY need to go and read plenty of docs if you didn't understand my explanation because this isn't an easy subject matter at all if you want to understand the low level stuff.

#5166420 hyphotetical raw gpu programming

Posted by phantom on 12 July 2014 - 09:39 AM

I dont quite see what this sheduling device is doing, i see you say its something like microcode in cpu.. that is dispatching one assembly stream into channels blocks etc..  If so does that meen that the gpu is able to execute only one input assembly stream 
and onlu paralelises it internally? So even if IP (instructon pointers) are separate those processors are not free to use
as those are covered by something as microcode manager?

(again, focusing on AMD's as it has the most documentation out there).

You are thinking about things at the wrong level; the GPU is doing more than one thing at once across multiple SIMDs inside multiple compute units (CU) - it's generally best, when talking on this level not to refer to the GPU at all but the internal units.

An stream of instructions is directed at a SIMD in a CU, and each SIMD can maintain 10 such instruction streams itself (so it has 10 instruction pointers). Each CU has four SIMD so it can keep 40 instruction streams in fight at once (each one made up of 64 threads, or instances, of the instruction stream which can have their own data but execute the same instruction).

However the SIMD don't decide what is executed next because the CU has shared resources the programs need to use which is why each CU has a scheduler deciding what to run next. The simplest part of this is deciding which SIMD unit to look at to get each instruction stream (it uses a simple round-robin system), after that it looks at all the wavefronts/instruction streams being executed and decides what to run next.

The choice is based upon the current state of the CU; for example if one wavefront wants to execute a scalar instruction but the scalar unit is currently busy then it won't get to execute. Same goes for local memory reads and writes as well as global reads and writes; if other SIMD wavefronts have taken up the resource then the work can't be carried out.

The reason this needs to be pretty quick is each clock cycle the scheduler has to look at the state of up to 10 wavefronts and decide which instructions to execute; this isn't something which is going to work very well if written in software as a single clock cycle would, at best, be enough to run one instruction.

So, if you want to think about it at the GPU level then if we take the R290X version of the GCN core; it can be running 44 CU * 4 SIMD * 10 waves of work at any given time; that work could be from one program or it could be from 1760 different programs/instruction stream. (Which equates to 112,640 instances of programs running at once) and every cycle 1/10th of those are looked at and work scheduled to run.

#5166334 hyphotetical raw gpu programming

Posted by phantom on 11 July 2014 - 07:04 PM

You guys are wayyy over my head with this stuff.  I'm kinda with fir on this; I only have a vague notion of what a GPU does, but I figure it's like he says, a vast array of memory as data input, a similar vast array as output, and a set of processors that read and process instructions from yet another array of memory to transform the input to the output.  Is that not the case?

At a high level, yes that could be the case but that's taking the birds eye view of things smile.png

Do all the processing units always work in lock-step or can they be divided into subgroups each processing a different program on different input sets?

Yes and no.

This is where things get fun as it immediately depends on the architecture at hand. I'll deal with AMD's latest GCN because they have opened a lot of docs on how it work.

The basic unit of the GPU, he building block, is the "Compute Unit" or "CU" in their terminology.

The CU itself is made up of a scheduler, 4 groups of 16 SIMD units, a scalar unit, a branch/message unit, local data store, a 4 banks of vector registers, a bank of scalar registers, texture filter units, texture load/store units and an L1 cache.

The scheduler is where the work comes in and where things kick off being complicated right away as it can keep multiple program kernels in flight. A single scheduler can keep up to 2560 threads in flight at once and each cycle can issue up to 5 instructions to the various units from any of the kernels it has in flight.

The work itself is divided up into 'wavefronts', these are a grouping of 64 threads from which will be executing in lock step.

So the work spread is 10 waves of 64 threads spread over the 4 SIMD units.
Each of these waves could come from a different program.

Each clock cycle a SIMD is considered for execution, at which point each wave on that SIMD get a chance to execute an instruction (at most 1) and up to 5 instructions can be issued (Vector ALU, Vector Memory read/write/atomic, Scalar (see below), branch, local data share, export or global data share, a special instruction. Note; more instruction types than can be issued and only one of each type can be issued per clock.

(The scalar unit is it own execution unit in it's own right; the scheduler issues instructions to it but they can be ALU, memory or flow control instructions. Up to 1 per clock can be issued.)

The SIMD units aren't vectored however; to perform a vector operation on a SIMD takes 4 cycles. So if you were doing a vec4 + vec4 on SIMD0 it would take 4 cycles per component before the result was ready and the next instruction can be issued - the work is effectively issued as 4 add instructions across 64 threads run in groups of 16. (However during those 3 cycles the scheduler will be considering SIMD1-3 for execution so work is still being done on the CU).
(For sanity sake however we basically pretend that all 64 threads in a work group execute at the same time; it's basically the same thing from a logical point of view.)

So, in one CU, at any given time, up to 10 programs can be running per SIMD with 40 programs in flight in the CU managing up to 2560 threads of data. This is a theoretical maximum however as it depends on what resources the CU has; the vector register banks are statically allocated so if one program comes along and grabs all of them on one SIMD then no more work can be issued on it until it has been completed. This memory file is 64KB in size which means you have 16384 registers (64KB/4byte) per SIMD, however this is statically shared across all wave fronts so if, for example, you have a program where each thread requires 84 registers the SIMD can only maintain 3 wavefronts in flight as it doesn't have the resources for any more (3x64x84 = 16128, to issue another wavefront from the same kernel would require another 5376 registers it doesn't have space for). (In theory the SIMD could be handed off another program which only required 3 vgprs to work so another wavefront could be launch but in practise that is unlikely.)
(SGPR are also limited across the whole CU as the scalar unit is shared between all SIMDs.)

So, given an easy program flow which is only 64 threads in size.
- Program is handed off to the CU
- CU's scheduler assigns it to a SIMD unit
- Each clock cycle the scheduler looks at a SIMD unit and decides which instructions from which wavefront is executed.

If you have more than 64 threads in the group of work, then this would be broken up and spread across either different SIMDS or different wavefronts in the same SIMD. It will always reside on the same CU however; this is because of memory barriers etc needed to treat the execution as one group.
(The 64 thread limit is useful to know however because if you write code which fits into a wave front then you can assume all 64 threads are at the same place at the same time so you can drop atomic operations for operating on local memory stores etc).

There is also a lot not covered here as the GPU requires you manage the cache yourself for memory read/write operations and there is a lot of complex detail, most of which is hidden by the graphic/compute API of choice which will Do The Right Thing for you.

Of course a GPU isn't made up of just one CU; a R290X for example has 44 CUs which means it can have up to 112,640 work items in flight at once.

Pulling back out from the CU we arrive at the Shader Engine; this is a grouping of N CUs which contain the geometry processor, rasterizer and ROP/render backend units - the GP and Rasterizer push work into the CU; the ROP take 'exports' and do the various graphics blending operations etc to write data out.

Stepping back up from that again we come to the Global Data Store and L2 cache which is shared between all the Shader Engines.

Feeding all of this is the GPU front end which consists of a Graphics Command Processor and Asynchronous Compute Engines (ACEs); AMD GPUs have one GCP and up to 8 ACEs, all of which operate independently of each other. The GCP handles traditional graphics programming tasks (as well as compute), where as the ACEs are only for compute work. While the GCP only handles the graphics queue the ACEs can handle multiple command queues (up to 8 each) meaning that you have 64+ ways of feeding commands into the GPU.

The ACEs can operate out-of-order internally (theoretically allowing you to do task graphs on the GPU) and per-cycle can create a workgroup and dispatch one wavefront from that workgroup to the CUs.

So, a compute flow would be;
- work is presented to GCP or ACE
- workgroup is created and wavefront dispatched to a CU
- CU associates wavefront with SIMD
- each clock cycle a CU looks at a wavefront on a SIMD and dispatches work from it.

Data fetches themselves in the CU are effectively 'raw' pointer based; typically some VGPR or SGPR are used to pass in tables of data, effectively base addresses, at which point the memory can be fetched. (There is a whole L1/L2 cache architecture in place).

There are probably other things I've missed (bank conflicts on Local data store springs to mind...) but keep in mind this is specific to AMD's GCN architecture (and if you want to know more/details then AMD's developer page is a good place to go; white papers and presentations can be found there - even I had to reference one to keep the numbers/details straight in my head).

NV is slightly different and the mobile architectures are going to be very different again (they work on a binned-tiled rendering system so their data flow is different), as are the older GPUS and in a few years probably the newer ones too.

#5166247 Giving interviews while already in a job

Posted by phantom on 11 July 2014 - 10:37 AM

Interviewing while working somewhere is pretty much the norm so don't worry about that; at most just don't go advertising the fact to where you currently work in case you can't find somewhere right away - if only to stay classy smile.png

As for why you are leaving; I've always found the truth can't hurt, depending on the truth - if you aren't happy somewhere then saying so with qualifiers is unlikely to do you any harm, just don't tear into the company you currently work for smile.png However after four years you can always go with the tried and tested "I've been in the job for a while now and I'm just looking for a new challenge and something more interesting to do outside of that area" - I pretty much used a combination of the two when interviewing for my current position (while at my previous job).

#5163903 very strange bug (when runing c basic arrays code)

Posted by phantom on 30 June 2014 - 01:35 PM

No, this isn't "propaganda" this is a basic skill requirement for any programmer.

Learn to use a debugger and you can solve problems like this in no time.

Your out right refusal to learn to use the basic tools for the trade is wasting yours, and more importantly, everyone else's time and if it wasn't for the fact I'm involved in this thread I'd close it right now because you do not deserve the community's time and help because you refuse to pick up even the basic level of competence in the tools for the trade.

#5163899 very strange bug (when runing c basic arrays code)

Posted by phantom on 30 June 2014 - 01:27 PM

i forgot this, sad 8 is not much - but still there is this unalignment problem - this seem abnormal to not align this on stack

Maybe it is aligned.
Maybe it isn't.

Do you know what would tell you?


It REALLY isn't that hard...

#5163877 very strange bug (when runing c basic arrays code)

Posted by phantom on 30 June 2014 - 12:10 PM

No, it's not a matter of 'what values' it's a matter of YOU learning to use the RIGHT TOOLS to debug this.

Given the call stack, memory dumps and other information this would probably be trivial to work out - use the right tool and stop trying to get others to debug things blindly for you.

This kind of thing is PRECISELY why the tools exist.
(Same with using a real profiler to work out performance problems.)