There's an even faster way if you're doing triangle rasterization. You can split the triangle into two edge pairs (based on the vertex with the middle Y coordinate). For instance, in your original image, you would split the triangle (v0,v1,v2) horizontally through v0 to produce two triangles that can be rasterized in one pass each. There are two stages to the interpolation: edge interpolation which proceeds from bottom to top and horizontal interpolation from left to right.

First you calculate the derivative for each interpolated quantity with respect to each vertical pixel. This allows you to incrementally compute the interpolated values at each edge point, starting from the bottom of each edge. Interpolating both vertical edges of a triangle at once, you then perform a similar incremental interpolation across each scan line. This means that most pixels on big triangles only require an add operation for each pixel/interpolated value to compute the next one. You can't get any faster than that.

I got halfspace rejection with block rastererization (see here) running about 3-4 times faster than my previous scanline-based code (~1.6-2M cycles down to about 0.5M per triangle). I never considered multithreading my scanline code, but halfspaces completely decouple interpolation across the triangle (eg rounding errors from deltas) from position information, which is perfect for multithreading. The only trick is getting interpolated vertex attributes, which this thread is about. Since edge derivatives aren't available to me, I can't exploit them the way you're describing (my scanline code did, though).

EDIT: note that of the 0.5M cycles only about 0.06-0.1 are spent interpolating across the triangle and attribute interpolation. The rest of the overhead is from type conversions and shading the fragment itself.

Also - my test triangles are relatively small, but I'm keeping them constant size for comparison.