I'll try to disentangle this a bit for you.
- Suppose you have the microwave. You would typically have a file where the model is saved. There the origin--in object space--could be placed at the lower left front point for example (or the middle of the lower surface of its bounding box or just the weighted average of all points of it or whatever you like).
- Now you place the microwave onto a table, the table is inside the kitchen, the kitchen inside the house, the house someplace in the world. That would give a whole hierarchy of matrices to apply to convert the points of the microwave into world space (thats what people who make a scenegraph do). But you would more likely want to simplify this by combining all that into a single model matrix per object, that the gpu can use for the transform, or directly placing the microwave and all other objects at some place--in world space--with the origin in the middle or one corner of your map or whereever you like to.
- Now you want to watch the world and therefore you place a camera somewhere in world space. From that you calculate the view matrix that transforms world space into--view space--, where the camera is always at the origin.
- For further simplification the view matrix and the model matrix get often multiplied together on cpu once to get the modelview matrix and only this is given to the gpu.
- Then you generate the projection matrix that squeezes the world into a tiny cube and can give the illusion of further away objects being smaller by squeezing them more.
- Later you see one slice of that cube stretched onto the available screen space (thats what the viewport settings are for).