I'm surprised VAO should be slower because not only is it fewer API calls, and not only can the driver cache and validate updront the various buffers, but it can also cache the validation. This is admittedly less expensive than in the case of a FBO (which is why it's faster to switch 2 FBOs than to add/remove attachments to a single one), but still it necessarily means touching fewer objects spread out in memory, and thus fewer cache misses.
Depends on Valve's usage, to be honest. E.g. a common enough scenario is to use the same vertex format and layout but to change the buffers; without GL_ARB_vertex_attrib_binding it's not possible to do this without respecifying the entire VAO, so there's not only no caching going on in this scenario, but also the extra overhead of VAO respecification and revalidation (at which point in time you may as well not be using VAOs at all).
I highly doubt that Valve are using GL_ARB_vertex_attrib_binding as many AMD cards, and all Intel cards, don't support it, and Valve's products must run on that hardware.
I'd also draw your attention to their earlier observation (in the same presentation) about GL being chatty but efficient, and not to judge a piece of code by number of calls. It's easy enough to concieve of a single API call that does a lot more work than multiple calls, so it really depends on the amount of work that each API call has to do. If - as I suspect - most vendors implement VAOs primarily as a user-mode software wrapper, with lazy state changes calling into kernel mode to flush changed VAO states to the hardware when a draw call is made, the API overhead of single call versus multiple calls should really be very minimal.
That said, it surprises me they're discouraging MapBuffer, too. In my experience, MapBufferRange is just about the same as CopyBufferSubData, with the difference that you can offload the copy to another thread. And if the GPU sync really bites you as they suggest, there's still MAP_UNSYNCHRONIZED_BIT which you can use as described by Hrabcak and Masserann in Cozzi/Riccio's book. That not only avoids synchronization and lets you offload the copy to another thread, but it also avoids having the driver perform memory allocation and reclamation work.
Surely the Valve guys would know about that technique?
Valve definitely know about this technique because it's the way D3D buffer updates work, so they've been using it in D3D for over 10 years now; it's very straightforward to port D3D discard/no-overwrite code to MapBufferRange (the API calls used match up very well) so they must have another reason for not using MapBufferRange. Again, I'd suggest that this reason is because GL_ARB_map_buffer_range may not be available on all of their target hardware. Raw MapBuffer (i.e. without "Range" ) has several problems so BufferSubData is definitely to be preferred over that in cases where MapBufferRange isn't available.