An optimization you can make when doing forward rendering is to calculate the TBN matrix in the vertex shader and use it to pre-transform the lightvector into tangent space before sending the transformed lightvector into the pixel shader. Then in the pixel shader, you read the normal from the normal map and perform dot with the incoming light vector. Since the incoming light vector is already pre-transformed into tangent space you don't need to modify it or do anything weird with the normal from the normal map. You'll need to normalize the incoming lightvector in the pixel shader, since interpolation of might cause some shortening.
And no, you don't sample the normal map in the vertex shader.
Regarding your posted list of steps:
1. From the vertex shader 3 vertices(with normals) are outputted.
No. A vertex shader operates only on a single vertex at a time. Values for each vertex given to the shader are calculated and output to an intermediate assembly stage. For a triangle, once the 3 vertices of the triangle have been processed by the shader, then this intermediate stage of the driver calculates interpolated versions of these three sets of data, and hands these interpolated values to the pixel shader as the set of input data for the current pixel.
2. In the pixel shader; an interpolated normal (N1) from these 3 vertex normals is passed to the PS.
Basically, yes.
3. A sampled Normal (N2) from the normal map is retrieved.
Yes. Note that the normal stored in a tangent-space normal map is encoded, so this step will need to include steps to decode the normal. Particularly, the x and y components need to be multiplied by 0.5 then subtract 0.5, to correct them from the [0,1] coordinate space to the [-1,1] coordinate space. The z component, or blue channel, is typically set to 1 so once the x/y components are decoded the whole vector needs to be normalized to unit length. This decoded normal represents the surface normal of the fragment in tangent space.
4. Using (N1) and (N2) we create a transform matrix (TBN).
No, the TBN matrix is constructed from the vertex normal, the tangent and the bi-tangent. These three vectors form a "miniature" 3D coordinate space, if you will, where the normal corresponds to the local Z axis of the space, and the tangent and bitangent correspond to the X and Y axes of the space. These 3 vectors need to be perpendicular (orthogonal) to one another, just as the global X, Y and Z axes are perpendicular to one another. Typically, a tangent vector for each vertex is calculated in a pre-process pass (probably during model export at asset creation, or some sort of process pass either at a later stage or when the model itself is loaded into the game) and passed to the vertex shader as an attribute along with the normal and vertex position. The bitangent can be calculated in the shader by taking the cross-product of the normal and the tangent.
Once the TBN matrix is calculated, it is used to transform the light vector, which is originally in World space. After the transformation, the light vector will now be pointing relative to this miniature coordinate space (called tangent space). Since the decoded normal is also pointing relative to the tangent space, then a dot product between the light vector and the decoded normal vector will calculate the correct shading for the pixel. This shading is then applied to the diffuse color to get the final diffuse value for the fragment.