Improve density of group context setting macros
Makes weak arm cores a little happier. Saw improvement on pi4. Mostly noise on the x86 cpus I tested on an earlier version. Not sure if continuing this past 128 bits is also an improvement.
Had to use clz to coerce the clang into being smarter. Using xor instead of a subtract seemed to have made clang even smarter for some reason.
Need to write up more, but it's late and I'm tired.