OpenGl and DirectX use different coordination systems (the handness differs), so you cant use the same matrix for both APIs. At some point you need to convert them. Although opengl has 3 textures coord (when using 3d textures), but you can ignore this coord when using only uv-mapping. Though you still need a 4x4 matrix. If directX only handles uv-mapping, then a 3x3 matrix would be sufficient.
I'm not sure how this applies to a texture matrix, and anyway D3D has had the capability to use right-handed coordinates since at least D3D8, if not earlier, so handedness differences haven't been relevant for a long time elsewhere either...
Make sure when dealing with GL_TEXTURE matrices, that each texture coord has its own matrix. glActiveTexture(GL_TEXTURE0 - 8) you can set matrices for all of those, so make sure if you do that you reset them or all your textures will get F'd.
I'll add - this is via a glLoadIdentity when done, making sure that the correct texture is active and glMatrixMode is GL_TEXTURE.
One VBO per object is one valid way of doing it, if you don't have too many different object types, if you can sort by object type before drawing, and if you implement some state-change filtering on glBindBuffer. A couple of 10s of buffer object changes per-frame isn't going to significantly show up on any performance graph (you'll be more likely to bottleneck on transforms and fillrate), a couple of 100s is more likely to be problematical.
For sprites however, and as you've deduced, it's not a good idea. 4 vertices per buffer isn't a great ratio, and if you're animating as well, you may be even needing to fill all of these buffers dynamically each frame too. That's not the kind of use case that buffer objects were designed for.
That's not to say that you shouldn't use buffer objects at all for drawing sprites, just that drawing them is a little bit more complex than "just use a VBO". You need to implement some buffer object streaming in order to do this and make it efficient. The link I've provided gives a good overview of the concept, several implementation possibilities, and useful links for further reading.
If this sounds like it's going to be overkill for what you want to do, then another alternative is to just not use VBOs at all. Client-side vertex arrays are one option, but another is just using glBegin/glEnd. If you don't absolutely have to use VBOs then the latter can be viable, and you should definitely benchmark it and consider it as a possibility. You may read horrible things about it elsewhere, but remember - Quake used it in 1996, it didn't suffer from horrible performance problems on account of using it, and it can easily hit almost 1000fps on modern hardware. So - if vertex counts are low enough to begin with, if fillrate is going to be a bigger bottleneck anyway (as will be the case with sprites), and if you can already hit your performance target without needing to optimize further, glBegin/glEnd can be perfectly OK to use.
First of all, you're actually not handling your own matrices in much of this code. You're using glLoadIdentity/glPushMatrix/glPopMatrix/glTranslate - that's not handling your own, that's using the GL matrix stack.
Secondly, transforms specified by a matrix only apply to objects drawn after you specify those transforms. So your glMultMatrix call at the end is actually doing nothing. The correct sequence is to set a matrix, then draw an object. The object is drawn with the matrix you've just set applied. Setting a matrix after you draw an object has no effect on the object just drawn. Objects are drawn based on current state, and future state doesn't affect previously drawn objects.
Thirdly, your mixture of RM/CM and post/pre multiplication is a recipe for disaster. You'll find things a lot easier to understand and debug in the future (and you'll have cleaner code) if you pick one convention and stick to it. Mixing multiple conventions means that you are going to get things wrong at some point - this isn't "if", it's "when". And when it happens you'll need to disentangle your mess and hope you remember which convention is used (and which is expected) at each part of your code in order to troubleshoot. If that sounds horrible it's because it is. You say that you "need it somehow independent of the OpenGL style" which indicates to me that you don't fully understand what you're doing here.
And finally - that glFlush at the end of your render function? Have you got a single-buffered context? If so, get rid of it and create a proper double-buffered one (no, it's not more complex), replacing your glFlush with the appropriate SwapBuffers call (see your API or framework documentation for this).