I'm currently building a little pet project game engine so I can feel out the lwjgl, and doing what I can to put it in parallel properly.
Right now it's divided into 1 Display Thread and 3 Logic Threads on my quad core (its actually setup to always to 1 Display and N Cores -1 Logic). Under heavy load if I had 1 more logic thread to try and soak up the left over time that the Display thread is using it messes with the Phaser just enough to drop my fps and only marginally increase my utilization. Under heavy load I get about 3 cores working solid and just a marginal usage on the core using the display.
The Display thread cannot run in different threads so I'm not using a thread pool.(If I do I'll just loose the display context) Which is fine since all it is doing is sending the gl commands to the graphics card. It is running in parallel with the logic threads and I get around needing to sync/lock the memory by creating two sets of memory (one read and one write) that swap at the end of each cycle. The threads use a Phaser to avoid having any individual thread get too far ahead of the other, and so one doesn't swap the memory ahead of the others and start writing on the others reads.
Basically instead of writing
x += y;
Write.x = Read.x + Read.y;
Currently it appears to beat the single threaded version so I think it is going in the right directly, but I'm keeping a single threaded test around just to make sure for now.
I now want to add in a texture loaded. Due to the limitation of gl command can only be run on the Display object and the Display cannot be accessed on the Logic threads without loosing the Display context I am in part confined to running many of the commands on the Display thread. This is fine since right now the Display thread doesn't do much but draw objects to the screen anyways. However, loading a texture requires that I discover I need a texture, read in the image texture from the hard drive, then construct a texture object then load/push it out to the graphics card through the gl commands.
I figure the logic threads will do the discovery part ether as an on demand feature or at the beginning of a level queuing up load requests.
I think I should then try a low priority worker thread that periodically looks at a ConcurrentLinkedQueue to see if their are any load requests. It will then read in the Image file, and send a task request to the Display Thread through another ConcurrentLinkedQueue. The Display Thread would then finish up and load the the texture with the remaining gl commands and binding.
What I'm hoping is that the low priority worker que doesn't interfere in the same way adding a 4th logic thread does on my quad core. I only need 1 since the optimal IO usage with most hard drives is 1 file being read en mass. Their might be some side effect with the display like Texture Pop In if it needs more textures while a level is loading or it would just display the loading screen if their are any pending textures. I figure I'd use my Logic 0 thread to control whether or not the Texture Worker Que is even running since their would be no reason to have it even exist as a Thread unless their is a reason to expect incoming work, and my Logic 0 has been where I've been putting control processes.
Does a Parallel Texture Loader like this already exist, and if not why not? I'd rather not reinvent one if one already exists, and if their is a known pitfall to this method it would be good to know. I might have to sacrifice 1 logic thread to get this to work right and I'd rather not if it can be avoided, but if I do it's not too big of a loss. It might seem like a lot on my quad but it should be minimal loss for an 8 core bulldozer.
I wrote a long reply, but accendentally hit 'back' and it got deleted... As a side note, never use frames-per-second numbers to measure performance, use milliseconds-per-frame instead.
Your frame timings actually seem pretty good (until the addition of the 4th thread, but that's overburdening a quad-core anyway). Your numbers for 2 and 3 threads are almost on the ideal timings (1/2 of 1 thread and 1/3 of 1 thread). However, before you get too worried about the issue of the 4 thread not offering performance, you should time how long the display thread takes to perform one frame. If your display thread is taking ~27ms per frame, then it will act as a bottleneck for the other threads, once they reach that speed.
Also, I wouldn't worry about the performance impact of a background loading thread -- this thread will not be performing CPU intensive tasks and will spend most of it's time sleeping anyway.