usually, if you can, sort by shaders
To be brutally honest, i dont think anyone will tell you "4000 render calls with shader changes, bindings and matrix operations is just fine"
its not, its not even "not fine"
so, clean your code up a little, figure out what goes where, and try sort by shader
then, start rendering larger objects in less calls where possible
also, if you are going for the large-amount-of-objects route, extend your matrix library do alot less multiplications where its not needed
this isnt hard to do, and if you know how they work already it should take you 5mins to make translateXZ/XY/ZY since translations are the most common call
also, if you start profiling and find out that the bottleneck is elsewhere, you can post again with the problem area and maybe we can figure something out there as well
you never know