Thanks for the tips, mhagain. I will see about trying glMapBufferRange, and would consider upgrading to OpenGL 4.3 if glInvalidateTexSubImage would alleviate this problem. The implementation is very specialized and doesnt need to work on a wide variety of (older) hardware.
I have actually come to this point from using raw glTexSubImage calls. They are unfortunately slower than using the PBO's. What takes about 11-12 ms through glMapBuffer + glTexSubImage2D is closer to 18 ms when avoiding PBO and just performing the straight glTexSubImage2D calls.
I am using GL_BGRA for exactly the reasons you described. In particular I read the PBO's hate GL_RGB. However, interestingly enough, I tried GL_RGB when using the straight glTexSubImage2D calls (my source video frame data is 24 bit so I figured Id try it to avoid the client side conversion to 32-bit). In my tests, using glTexSubImage2D for a 32 bit GL_BGRA image was actually 25% slower than the 24-bit, so I guess it was more limited by transfer bandwidth than pixel format conversion speed?
As an update to this, after doing some additional reading, I wonder if I would benefit from having more than two PBO's? It seems that maybe creating 3-6 in a circular buffer pattern might reduce the chances of a sync stall when mapping a buffer for writing that hasn't been used for 2-5 frames?