This is how I've done it in the past:
Grass "clumps" are placed in the world individually or via a spray brush in the world editor. The brush and placement tool have various options to make this easy including an 'align to terrain' behavior and various controls over how sizes, orientations and texture variations are handled. This process generates a massive list of positions, scales and orientations (8 floats per clump). There are millions of grass clumps so storing all this in the raw won't do...
At export time the global list of grass clumps is partitioned into a regular 3d grid. Each partition of the grid has a list of the clumps it owns and quantizes the positions, scales and orientations into 2 32 bit values. The fist value contains four 8bit integers: the first 3 ints represent a normalized position w.r.t. the partition's extents and the 4th is just a uniform scale factor. The second 32 bit value is a quaternion with its components quantized to 8 bit integers (0 = -1 and 255 = 1).
At runtime the contents of nearby partitions are decompressed on the fly into data that's amiable to hardware instancing. This was a long time ago so it was important the vertex shader didn't spent to much time unpacking data. Theses days with bit operations available to the GPU you might be able to use the actual compressed data and not need to manage compressing and uncompressing chunks in real-time, if you do use a background thread.
It worked pretty well and was extended to support arbitrary meshes so the initial concept of a grass clump evolved to include pebbles, flowers, sticks, debris etc. Any mesh that was static and replicated many times throughout the world was a good candidate for this system as long as its vertex count was low enough to justify the loss of the post transform cache caused by the type of HW instancing being used.