I would explore the octree approach, though that also comes with its own challenges of figuring out how to represent the octree in a structure so as to make it efficient to store and read as well, because the number of octree nodes for each chunk can differ greatly from one another. However I do see their usefulness in storing multi-dimensional bitmasks.
For me it is easier to understand just subdividing each chunk into a fixed size smaller chunks and then determining if these sub-chunks are completely "empty" or "full". Same as the octree optimization but with a static number of divisions.
For example, if your chunks are 64^3 blocks each, you can subdivide them into 512 8^3 sections. For each of these sections examine all the blocks several at a time. Assuming you use one bit to represent an empty or full block, you can examine 32 at a time with a 4-byte int with a XOR operation with 2^32 - 1. This returns a zero if all blocks are full, or a non-zero value otherwise. Do these for the entire 8^3 section. If all totaled checks equal zero, we have a completely solid or empty section. So we can store that as just one bit.
For each sub-section, if dense storage is required, store one bit for every block here. If simple storage is possible, store exactly one bit that will account for all blocks. Then keep 512 bit flags per chunk - each bit tells you whether simple or dense storage is used for the 512 sub-sections. 1 = dense storage, 0 = simple storage (you can reverse the flag meanings if you want).
So laying it out further you can write a flat structure for each chunk like this:
[512 bit flags] [dense storage bits] [simple storage bits]
The total number of storage bits will vary for each chunk depending on how complex its topology is. Something that will look like swiss cheese will need more space to represent it than a completely solid or empty chunk. Each bit flag will determine whether to read ahead for dense storage or simple storage to get the voxel information. As the # of simple bits may not be divisible by 8, you may want to pad this data at the end.
Recently I have entertained the use of voxel meshes for my project, and for me that chunk subdivision works well enough to store smaller files for the world. I can probably get somewhere in the order of a few dozen megabytes for a 4096x4096x256 world, whereas a worst-case or completely uncompressed world of that size requires 512mb. I guess that's where octrees can optimize this further, but as I said before, fixed-size divisions are easier for me to implement.
Personally I'm more trying figure out to reduce the blockiness and more how to implement marching cubes efficiently. I assume that you shouldn't have to brute force march every single set of corners in the world, or else that would be billions of operations to be done!