The thing here is that you're only measuring the front-end of the graphics pipeline instead of doing a proper end-to-end measurement. So you count the number of vertices you have, you see that you can cut them to a quarter, you think that this should be a huge win, but all too often it isn't.
There's a LOT of stuff that goes on beyond mere vertex submission and just measuring number of vertices and expecting that to be a primary determinator of performance is a case of ignoring all of the other stuff.
For, say, a quad that occupies a 64x64 region of the screen, you have 4 vertices but 4096 pixels/fragments. That alone should tell you that the pixel/fragment side is where the real gains are to be made.
In the case of the GS stage, just having it active with nothing more than a simple pass-through shader will give you a 5% to 20% performance hit. So you need to be absolutely certain that the operations you're doing in there (specifically moving per-vertex work to per-primitive) are going to give you back more than 5% to 20% in order for it to be viable. And as I've discussed above, the vertex overhead is going to be so low that this is hugely unlikely in your use case.
The exception is cases where you absolutely must use the GS, such as multiple viewports, stream out, adjacency, calculating normals, or anything else that needs to operate on an entire primitive. That's where the GS is useful. But using it to generate additional vertices on the fly - not so much.