What I'm still a _little_ hazy about is that you said the projection transform produces clip space. I thought that there has to be a clip matrix to transform from projection to clip space? And I was wondering where that happens, because I don't remember setting a clip matrix anywhere, does DX do it under the hood? Or is it some covert function like SetViewPort or something?
I believe that the perspective matrix in DirectX is already defined so that any point, that falls into the view frustum, is projected into a point in the range of ( -1, -1, 0 ), ( 1, 1, 1 ), so no additional transform is necessary to map the canonical view volume into clip space, as they are pretty much the same.
This need not be true for any perspective matrix: for instance, the most simple example of such a matrix can look something like this:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 1 0
Multiplying a point P( x, y, z, 1 ) by this matrix yields P'( x, y, z, z ), and after homogenization (division by the w-component) is applied, this becomes P''( x/z, y/z, 1, 1 ). P'' is the projection of point P onto the plane z=1, with perspective foreshortening applied (note that x/z and y/z get smaller as the distance from the viewer increases).
Although the basic idea behind the perspective matrix in DirectX is the same, its layout is a bit more complicated. It scales the point's x and y coordinates to the [-1, 1] range. The z-coordinate is also preserved (unlike the example above) and mapped into [0, 1] range. You can read about DirectX (and OpenGL) projection matrices in grater detail for example in this article (section on Perspective projection).
Also I just worked out a little math yesterday, and I came up with this matrix to get from world to clip directly. Just as a side note, I'm working in 2D so Z is always 1. If I have a vector in world space with x, y coords, then in clip space x' = (x/screen_width) * 2 and y' = (y/screen_height)*2 that would make the vector lie in 0 to 2 space, then I subtract 1 from each to put it into canonical space. So the final matrix comes out to
2/scr_width 0 -1
0 2/scn_height -1
0 0 1
will this work ok?
Actually, it depends. How do you define your world space, i.e. what units do you use and which direction are the x and y axes pointing? The decisions are more or less arbitrary and will depend mostly on your personal taste, but sticking to them consistently matters a lot. The matrix you've constructed should work, if your world units are pixels (sprite positions and scr_width, scr_height are measured in pixels), and you have the X-axis pointing to the right and Y-axis pointing up. If you prefer working in screen space, with the Y-axes pointing down, then your matrix will actually translate your entire screen out of the canonical view space, so nothing will be visible. To fix that you would need to translate +1, not -1 on the y axis, and also mirror the y-coordinates:
2/scr_width 0 -1
0 -2/scn_height 1
0 0 1
edit: A: no it won't work! the model is not is screen_space Q: what if I make sure that the model never exceeds screen space dimensions in the world, as in the world coordinates will always be more then 0 and less then screen_height, screen_widht? I want to transform a 2D sprite in a 2D software rasterizer, so I really don't need the projection, cameras space etc? Or would it still be better to use them?
Not sure if I understand you correctly. You don't have to limit you model coordinates to screen_width, screen_height, unless it is supposed to always stay visible on the screen, I guess. Also, if you position your models in, say, meters as opposed to pixels, it's also perfectly fine, you'll just need a slightly different matrix to transform them to normalized coordinates. As far as camera is concerned - again, it's totally up to you to choose, whether to use one or not. I'd say if your screen doesn't move at all, you don't need a camera. Otherwise having a camera can be convenient, as you will always have a point of reference "attached" to the visible part of your world and will be able to transform your scene into normalized view space based on that point of reference. If I'm not much mistaken, for a 2D camera you'll only need to specify the position and the view width and height. Then to transform a point from world to clip space you will just need to translate it by -cameraX, -cameraY and scale by 2/cameraViewWidth, 2/cameraViewHeight. You can think of it as the more general case to the matrix you've constructed, where a camera is positioned at (scr_width/2, scr_height/2) and its view dimensions are (scr_width, scr_height).
Well, hopefully it helps or at least doesn't confuse you even more Please correct me if something I've said is utter nonsense, as I can't say I'm 100% confident with the topic either