Let's back up a bit here. Our ultimate goal for our baked lightmap/probe data was to store a representation of the incoming radiance on the sphere or hemisphere surrounding a point, so that we can integrate our diffuse and specular BRDF's against that radiance in order to determine the outgoing reflectance for the viewing direction. In simpler terms we need to store the incoming lighting in a way that's convenient to compute the visible light coming out. Spherical Guassians (SG's) are nice for this purpose, since a small number of them can potentially represent the radiance field as long as the radiance is somewhat low-frequency. They can do this because they have a "width", which means that the sum of them can potentially represent the entire surface area of a hemisphere or sphere. SG's also convenient because it's possible to come up with a reasonable closed-form approximation for the integral of (BRDF * SG). This ultimately means that computing the outgoing lighting for a pixel is an O(N) operation, where N is the number of SG's per sample point.

This is all somewhat separate from the issue of how the data is actually stored in textures so that it can be accessed at runtime. In our game, we stored the data in 2D textures for static meshes (just like standard lightmaps) and in 3D textures for our probe grids. In practice we actually keep an array of textures, where each texture contains just 1 SG for each texel. Storing as textures means that we can use texture filtering hardware to interpolate spatially between neighboring texels, either along the surface for static meshes or in 3D space for dynamic meshes. Textures also allow you to use block compression formats like BC6H, which can let you save memory.

So to finally get your question, you're asking why you wouldn't store the baked data in lots of mini 3x3 or 4x4 environment maps. Let's look at this in terms of the two aspects I just mentioned: how the data is stored, and what you're actually storing. Having a set of environment maps allows you to interpolate in the angular domain, since each texel essentially represents a path of surface area on a sphere or hemisphere. However if we're using SG's, then we don't actually need angular interpolation. Instead we consider each SG individually, which requires no interpolation. However we do want to interpolate spatially, since we'll want to shade pixels at a rate that's likely to be greater than the density of the baked texels. As I mentioned earlier 2D or 3D textures work well for spatial interpolation since we can use hardware filtering, whereas if you used miniature environment maps you would have to manually interpolate between sample points.

Now to get the other part: what you actually store in the texture. You seem to be suggesting an approach where the environment contains some kind of single radiance value in each texel, instead of a distribution of radiance like you have with an SG. The problem here is how to do you integrate against your BRDF(s)? If you consider each texel to contain the radiance for an infinitely small solid angle, then you can essentially treat each texel as a directional light and iterate over each. However this is not great since there will be "holes" in the hemisphere that aren't covered by your sparse radiance samples. So instead of doing that, you might try to pre-integrate your BRDF against a whole bunch of radiance samples, and store that result per-texel (I believe that this is what you're suggesting when you say "approximate a cone"). This is pretty much exactly the approach used by most games for their specular IBL probes. The big catch with pre-filtering is that you can't actually pre-integrate a specular BRDF against a radiance field and store the result in a 2D texture, since the view dependence adds too many dimensions. So you're forced to pre-integrate the NDF assuming that V == N, sample the environment map based on the peak specular direction, and apply the fresnel/visibility terms after the fact. Separating the NDF from fresnel/visibility leads to error, and unfortunately this error gets worse as your roughness increases. The pre-integration also assumes a fixed roughness, and so you need to store a mip chain with different roughness values that you interpolate between. This doesn't sound particularly appealing to me due to the error from pre-integration and mip interpolation, and you also lose the benefit of being able to spatially interpolate between your sample points. On top of all of this you still need to handle your diffuse BRDF, which requires an entirely different pre-integration.

TL:DR - our approach also us to use hardware filtering to interpolate samples, and we can use an analytical approximation for both diffuse and specular without having to rely on pre-integration.