- Textures are taking 90 seconds [ Total 900 textures with total size 1056 MB]
- 820 MB [509 files] of the above textures are in DDS format , texture size is power of 2, and is DXT5
- 242 MB [392 files] of the above textures are in PNG format , where texture size is not power of 2 and is random
It would be better to not have a monolithic load for these, if you can design it to load them asynchronously at a less critical stage, the better for your game.
That said, let's check some things:
Are you loading the DDS images directly into a single location in memory using the fastest operations and a single call? That is, if you are reading and parsing the DDS images as you go along that is the wrong approach. As it is Windows, that means using asynchronous IO and callbacks. Open the files with CreateFile using the FILE_FLAG_OVERLAPPED and FILE_FLAG_SEQUENTIAL_SCAN flags, and use ReadFile to read the entire file into a single buffer at a time. Send all 509 file requests to the OS at once and it will intelligently take steps to minimize disk seeks, and pick up pieces of other files as the heads move across the platter. They'll each report back when the individual files are done.
If you do that, loading the 820 MB of textures should approach whatever the ideal transfer speed is from your disk. For a spindle drive that may be anywhere from 15 to 40 seconds. If instead your machine has a quality modern SSD, the time could be about three seconds.
For your 400 PNG files, the most direct answer is "Don't Do That!" DDS can be used directly from disk but PNG files need to be decoded, meaning you need time and space to process each one. Transform them as a build step if at all possible so you avoid the enormous runtime hit.
The exact decompress time will depends on the details of the image and the library.
If for some reason you absolutely must have PNG files rather than DDS files, parallelism is your friend. Make sure every processor is working. Depending on the libraries you are using that likely means multiple image decoders per virtual processor. (For strictly processor-bound tasks it is often best to have a 1:1 ratio, but that probably isn't what is taking place.) It will depend internally on how they are doing the work. If internally they are doing slow operations like blocking file reads and blocking memory tasks, you might consider going 2x, 3x, 4x or more per processor since they'll be spending so much time in those other non-compute operations.
If at all possible, use DDS files loaded directly into memory and not parsed.