this optimization was actually common when you were doing your own software rasterization, but nowadays just some specialized libs on consoles do that for rare cases, in most cases you are not vertex but pixel bound (if not, then you're doing something wrong ;) ).
some libraries also calculate the coverage of triangles and reject those if no pixel was touched by it or if it has no area in screenspace.
while a GPU can process bazillions of vertices and fragments in parallel, triangle processing is a very serial process. if you are unlucky, your GPU might have bubbles in the pipeline because the triangle setup is working on backface triangles and not feeding the fragments and not freeing enough buffer for vertices to let them work. e.g. if you have a sphere with 1Mio triangles on the screen and you rotate it in a way that back faces are processed first, there will likely be bubbles. if you rotate it in a way that front faces will be processed first, then the back face processing later on might be hidden as the fragment units might still be busy processing front faces.
an optimization of your idea is called "clustered backface culling", you basically sort your triangles based on the orientation into bins. later you just check the orientation of the bin and you can reject a bunch of faces in one go. of course, that's an approximation, there will be some backfaces left, but getting 80% of it culled with 1% of work is a good trade off.
http://zach.in.tu-clausthal.de/teaching/cg1_0607/literatur/clustered_backface_culling.pdf