e.g. if you've got a fixed GPU that you can talk to directly, then instead of calling graphics API functions at all, you can pre-compute the stream of packets of bytes that you would be sending to this hardware device and you can create a big buffer containing these bytes ahead of time, in a tool... then at runtime you can load that file of bytes, and start streaming them through to the GPU directly. It will behave as if you were calling all the right API functions, but with virtually zero CPU usage... That's only applicable if your rendering commands are static, so in one situation this might give a 100x saving, whereas in another situation it gives no savings.
honestly, I'm glad I don't ever work on projects where stuff like this is necessary. I watched this keynote with John Carmack talking about the texture strategy they used to get Rage to run well on the console - he talked a lot about how he hoped the graphics card companies would release driver sets for closer access to the hardware for pcs - basically saying that its so much easier to optimize code for the consoles because of the closeness the API has to the hardware
I personally do not enjoy this type of programming at all though - its interesting to read about but I hate the idea of writing code to correctly swap memory in and out of here and make sure the data is being sent as fast as possible there.. gosh how much of a headache that sounds like
APIs usually convert simple hardware calls into complex layers of abstract API calls. It always sounds like 'high level apis' make life easier, but in reality, it's the opposite. hardware can do so much more and so much faster if you talk directly to it and its way simpler. e.g. the now coming 'hipster' marketing buz of HUMA and unified architecture and what not, that's the fact if you work directly on the hardware. if you want, you can have just one memory allocator for the whole thing. allocating a texture is as simplle as
myTexture = new uint32_t[width*height];
and there it is (ok, in reality you have to allocate with some alignment etc. but I think you get my point here.
you want to fill it with data?
you do HDR tonemapping and you want to read the downsampled average tone, and you don't care if it's from the previous frame or even n-2, as you don't want to stall on this call (e.g. if someone has 4x SLI, you don't want stalling even on the n-2 frame as you'd effectivelly kill the SLI parallelization.
and there are tons of not obvious things, e.g. a drawcalls on PC goes through several security layers until your driver is called. the driver now has to figure out what states you've changed, what memory areas touched that should be synced to the some particular rendering device and finally it has to queue up the work and eventually add some sychronization primitives, as you might want to lock some buffer that are midway of the whole big commandbuffer it created.
on console a drawcall at the lowest level is simply
that's why an old PS2 can push more drawcalls than your super high end PC. On consoles, nobody is really drawcall count limited, simply because the hardware consumes a commandbuffer fast enough to be limited at other places if you don't do ridiculous stuff like making a drawcall per triangle. on PC and especially on phones (iOS/Android) you are frequently limited by drawcalls. a 333MHz PSP can draw more objects than your latest 2GHz quadcore cellphone that has a gpu close to x360/ps3 performance.
APIs make sense to keep stuff compatible, but I somehow doubt they make anything easier. they have in a lot of cases ridiculous limitations and a lot of times people just try to work around those. e.g. we had some register limits for shaders which made it necessary to introduce pixelshader 2.0a and 2.0b which essentially was one version for ATI and one for NV as they could not hack around in drivers work around the API limitation as they so frequently do in other cases.
it's no different nowadays, modern hardware can use 'boundless resources', which means, you can just set pointers to textures etc. and use those. nvidia supports some extensions which in combination allow you to draw the whole scene with very few drawcalls. but it's an extension, it will take time till maybe directx supports it and in reality, it's again just a work around for the APIs. on console you won't need that kind of multidraws, as you can simply get to the HW limit by pushing individual drawcalls.
and if someone doesn't like to just send drawcalls, then I suggest those person should not want to fizzle around with ogl/d3d, it's so much easier to get some engine that deals with that for you. you can still modify all aspects, but you don't have to and at that point, you'd not care what the engine does beneath. actually, you'd maybe want it to be directly on hardware, otherwise you build a level, it's very low poly, yet it becomes slow and you get told (even as an artist) "well, you cannot have more than 2500visible objects on the screen, it's slow, yes, I know you have all those tiny grass pieces that should render in a millisecond, but those are 2k drawcalls, go and combine those, but don't make them too big, we don't want to render all invisible grass either... huf fun with that instead of building another fun map"