An object itself does not have bounds. And if I attach a DirLight component to an object, what would be the AABB of the object? Or if I attach multiple components to the same object, eg. a RenderableMesh and a Collider. The Collider can be "bigger" than the actual mesh which means the two component's bounds are different. This was the reason why I decided to store the components and not the objects.
Since QuadTrees are used for collision detection (object-object, object-line, object-frustum, etc), the collision bounds should be used for placing the object in the tree. That means Collider component will give the position and size of the object that is used to place the object in the QuadTree.
Or do you want to use the tree for a different purpose?
As I understand QuadTree, it contains object ids or references to game objects, and not components. For me putting a component into a QuadTree is like a grin without a cat. You can access any component through a game object and the appropriate container. This is an extra indirection comparing to your solution, but I don't think it's much slower, especially because it doesn't need to store so much extra data.
Of course you can put some convenience methods in the QuadTree which return components, but it should use the game object and the component containers to do that.
A small optimization is to store the AABB box besides the game object reference in the QuadTree, it might simplify code and makes it a bit faster.
For the layers it's a solution to create a QuadTree for every layer and a common QuadTree containing all the layers. But that sounds like an overkill. Here the best thing you can do is to check which is faster in your case using a single QuadTree or using separated QuadTrees.
On ARM processor the alignment of data is important. It can read/write much faster at aligned positions. So 2-byte long value should be on an even address, and a 4-byte long value should be on an address dividable by 4. There are codes for unaligned read/write but they are different, longer and slower. The compiler must find out which code has to be used, and generally it does a very good job finding it out, but a few times fails.
These failed cases always contain a reinterpret_cast. In your case this is the C-style (int*) casting. That kind of casting is typically used in loading/saving data, or receiving/sending data through network. In other places it is typically just bad design, and should be avoided.
So I recommend you to concentrate all your code that uses reinterpret_cast into one or few classes, and handle the problem there.
And you can create a function for this kind of reading/writing similar to your doLittleBigEndianConversion. Like this:
float value = doLittleBigEndianConversion(doUnalignedReading(reinterpret_cast<int*>(ucharPtr + offset));
In that function just use the memcpy trick or read the data byte by byte and assemble it with |.