mov bx,Variable . . . add ax,bxyou rather write
lea di, m1 ;load the address of the opcode at marker "m1" add di, 3 ;this is an offset to the actual constant in the add instruction, skipping the opcode mov bx, Variable mov [di], bx . . . m1: add ax, 0 ;0 will be replaced by above codeI've used this in dos times for rasterization quite a lot, e.g. if u calculate the borders of a triangle with y=m*x+b, m and b are constants, yet they use otherwise precious register space (and you had just ax,bx,cd,dx,di,si beside stack etc.). and as those don't change, you can rather replace the values in the binary with those kind of constants.
next step that comes into your mind is, if you have some inner loop and you'd need to rewrite it 100 times for various different cases (and some guys do that e.g. http://www.amazon.com/Tricks-Programming-Gurus-Advanced-Graphics-Rasterization/dp/0672318350/ )you could just add some jumps and you modify the destination offset. static jumps are executed in a different part of the cpu than all the math etc. and are essentially free as there is no false prediction. that way you can switch on and off textures, blending etc. of the rasterize with just a few lines of code.
like said above, there are a few guys who write a runtime compiler for that, but that's the crazy banana version if you really really want to get the best performance, but that's rather for complex shader cases where you would otherwise end up with crazy amount of jumps. for simple cases (<=Direct3D 6) modifying some constants was good enough to get the needed performance. it made also no sense to copy around code chunks, as that copy would cost you time and would barely have a different runtime speed than a modified jump (aka jump table) to that code area.
today it's a bit dangerous, caches and pipelines assume that the code is static. even with just data you can run into hazards in multithreaded applications, that's even more dangerous for code segments. tho, it's not impossible, I think pretty much every OS allows you to unlock segments for writing/modifying and if you know the cpu architecture, you can enforce the syncs that are needed.
the craziest think I've done with SMC was for my raytracer, where I've basically 'dumped' the BSP tree as assembly code. Instead of a tiny loop that progresses randomly on either side of the BSP tree, the 'unrolled' code was processed mostly in a very similar way (every ray starts at the same place and most will be split by the same node as the previous ray and most will process the branch of the leaf as the previous node).
sadly it just worked out for a small BSP, before I've even ran out of L1 instruction cache, I've somehow run out of the space that the jump prediction can cover and then the performance dropped dramatically, below the version with the tiny loop. The next more 'crazy' step would be to evaluate every frame the most likely walking path of the BSP and dump a code tree that aligns with what the static code prediction would guess.. but I didn't do that as my way of SMC was to dump a c++ file and invoke the cl.exe of visual studio, which is ok on load time, but not if you have 16ms, to generate a binary-lib that I've parsed and copied into my binary.