I see two ways for frustum culling that are the most efficient:
* Brute force frustum culling on GPU feeding indirect draws.
* Hierarchical frustum AND occlusion culling on single threaded CPU.
What is better depends on the scene, probably coarse occlusion culling and fine grained frustum culling if you have indoor and outdoor scenes.
Occlusion culling is difficult and may require a low poly version of the scene, but even it's not suitable for multi threading it's very efficient.
E.g. i use a system that uses a software rasterizer to generate spans instead of pixels, so it's much more efficient than stupid GPU occlusion tests.
If the camera is inside a room the system detects quickly that neither further occlusion testing nor further rendering behind the walls is necessary.
If the camera is outdoor at the street, the system is expensive but still a win because it culls the interior of buildings except of what i can see through windows.
However, for current games a system like mine is probably overkill because houses don't have any interior, distant houses may be just a cheap low poly mesh, etc.
But the point is, just because it's un-threadable it's not necessarily a bad thing - there are always enough other threadable things to run at the same time.
Hierarichical methods are often un-threadable because they rely on early termination but they are perfectly work efficient for the same reason.
Don't fall into the trap to believe hardware is so fast nowadays that stupid brute force is better just because we have multi core and GPUs - that's a dead end.
but I feel that having each worker thread writing pairs of {object_type, object_pointer} into a single array would be inefficient because of locking.
Why don't you use one array per thread?
After the work is done do a prefix sum over the rusulting counts, and then you could do a multi threaded copy to fuse everything to one big array.