Then there preferable format should be some kind of binary format (unlike collada, that just points to the assets on the disk) that encapsulates the assets' binaries? Am I grasping this wrong?
Yes.
Theory runs that a packed file format contains 'chunks' of raw data (vertex data, audio buffers, etc) which are laid out exactly as they will be needed in memory. Loading these chunks is a straightforward read/mmap operation, with no further processing required.
In order to know what chunks you need to read/mmap, you also need metadata (basically, an index to the packed file). These are stored in their own chunks, which you read in, process, and then use to load the remaining chunks. The metadata chunks should generally be very small compared to the data chunks, so these are not always stored in binary - I've seen systems that store metadata in JSON.