I can pretty much guarantee where the slowdown is. It's not in the matrix multiplication, it's not in binding UBOs to the pipeline. The OP is doing a separate UBO update for each object drawn. That's potentially tens, hundreds or thousands of UBO updates per frame.
The slowdown is in GL's buffer object API, because you just can't make this kind of high-frequency update and still maintain performance when using it. Any profiling is just going to show a huge amount of time in the driver waiting for buffer object API calls to finish, waiting on CPU/GPU synchronization, and waiting on GL client/server synchronization.
The solution is to not use small UBOs and to not update per object. Instead you create a single UBO large enough to hold all objects, figure out the data that needs updating ahead of time, do one single big UBO update per frame (preferably via glBufferSubData), then a bunch of glBindBufferRange calls per-object. That runs fast, and in the absence of persistent mapping it's the only way to get performance out of UBOs.