2021.05.04 21:02 "Re: [Tiff] SIMD optimizations", by Larry Bank

There's no need to write asm code to make it fast. The problem is the awful way that the libtiff G4 encoder abuses memory. Clean C code will compile into a good result for G4 encoding and decoding. The decode side of the libtiff G4 codec isn't terrible because it's not doing anything awful with memory. The right way to count runs of 1-bit pixels is to use the CLZ (count leading zeros) instruction on the native integer size. GCC provides an intrinsic for it and has efficient code for systems that are missing this instruction (very few).

Larry B.