Add minor msac optimizations
Requires asm changes due to no longer shifting in 1's during refill, but shaves off a few instructions in common code paths which is probably still worth it considering those functions are called billions of times during decoding of the average video.
Overall decoding performance on x86 seems to improve by around 0.2% in my testing.
Edited by Henrik Gramner