For multiply operations you need a 16bit result potentially, unless you know your operands have a small magnitude. If you know that the left-most bit positions carrying a 1 in each operand sum to less than 8, a multiply won't overflow (e.g. 10000000 x 00000001 won't overflow, 00001111 x 00001111 won't overflow, as a kind of special case even 00010000 x 00001111 won't overflow -- but 00010001 x 00001111 does). To be suitable for larger operands, you need to provide more than 8 result bits -- at least temporarily if you know you can get the result back in range by the end of the algorithm. These extra bits would take the form of the most significant result bits.
Division is similar, but you need at least an extra result bit on top of the 16 bit result multiplication needs and all the extra bits would take the form of the least significant result bits.
One technique you could try would be to transform the algorithm algebraically to see if you can bend it into a form that's more suitable for your hardware -- e.g. replace divisions with multiplications by the inverse (which can win if you only need to calculate the inverse once and its used multiple times, or if you can accept a faster approximation -- a trick Quake used, IIRC). There are other old-school tricks that could help -- for example, multiplying and dividing by powers of two can be replaced with shifts, which are often but not always faster than multiply and usually faster and never slower than divide.
If your hardware really lacks SIMD (The ARM--I'm assuming, since its a smartphone--equivalent would be NEON, or VFP on very low-end devices, but I'm not sure how much either of them supports integer math) it might also have no/slow hardware divide instruction, in which case eliminating divides alone would be a huge gain. If your hardware does turn out to support NEON (and maybe VFP) you might find that its actually faster to use SIMD floating point math instead to take advantage of those instruction units, converting from and to your 126.96.36.199 bit format at the beginning and the end of your algorithm (or doing away with it in your filtering pipeline entirely).