What you are seeing is the correct behaviour, there's nothing wrong with your code. A camera will start to roll even if you're only inputting a sequence of pitch and yaw operations, because these are applied on the local coordinate system instead of the global coordinate system.
If you imagine holding a camera, hold it pointed straight ahead, then rotate it upward 90 degrees so it points up, then "pitch" it 90 degrees to the right. You'll find that it has in a sense "rolled" by 90 degrees, in that the bottom of the camera is no longer pointing to the ground. That's the similar effect that you're seeing.
If you want to force no roll whatsoever, than store an internal count of pitch degrees and yaw degrees, and build your view matrix each frame from a pitch then yaw. This will give you more typical behaviour like you might expect in a first person camera.
Model Matrix (transforms object space to world space)
View Matrix (transforms world space to eye space)
Projection Matrix (transforms eye space to screen space)
The model matrix is just the first of those three. Old fixed pipeline combined the model and view matrix together into the modelview matrix because it never required them separately. If your deferred renderer is defining the "position" in world space than you need the model matrix alone to do this transform. If you multiply a vertex by the modelview matrix you're going to get it's position in eye space, which probably isn't what you want.
All the built in gl_blahMatrix matrices are deprecated now anyway, you're supposed to be managing all matrices yourself on the client side, and then use uniform matrices to upload them to the shader.