2021.05.02 09:48 "Re: [Tiff] SIMD optimizations", by Even Rouault
That sounds interesting. I've given a try at your branch.
- The build error is due to a missing -mssse3 flag that is required for gcc. Otherwise for x86_64, it fallbacks to the baseline sse2 which is the only SIMD instruction set guaranteed to be available at runtime
- ideally we should have a generic build, that contains the base SSE2 code path (or just x86 for x86 32 bit) and the SSSE3 optimized code path, but only takes the later if SSSE3 is available at runtime (for detection of instruction sets, I've code for that in https://github.com/OSGeo/gdal/blob/master/gdal/port/cpl_cpu_features.cpp / https://github.com/OSGeo/gdal/blob/master/gdal/port/cpl_cpu_features.h that's compatible of libtiff license). The SSSE3 code should be put in a separate .c file that would be the only one built with -mssse3. I'm not familiar about the situation on the Arm / NEON side, but I presume this might be similar (or perhaps we could just live with an explicit turn-neon-simd flag at build time)
- we'd probably want autoconf support. (That's where maintaining several build systems is painful...)
- there is a bug in the SSSE3 code path. The pixel values I get after decompression are altered. The tiny attached file can demonstrate that.
- valgrind isn't happy either (not on the attached file, but on a slightly larger one: 400x200, stripped with strip height of 6, 3 interleaved channels of 8-bit each)
==2742662== Invalid read of size 16
==2742662== at 0x4890D68: _mm_loadu_si128 (emmintrin.h:703)
==2742662== by 0x4890D68: horAcc8 (tif_predict.c:376)
- the libtiff test suite is seriously lacking (but we already knew that) as "make check" is happy despite those bugs. We should definitely have additions into it that verify the good working of the modified code paths. For ARM that would be a bit tricky as our CI is x86 only. I guess we could use a gcc cross-compiler and qemu user-mode emulation to build & test for ARM as well from x86 machines.
- regarding performance gains, I only got a 5% improvement however on a 8000x8000 large image, lzw compressed, predictor 2, tiles of 512x512, 4 channels, interleaved, 16bit, compression ratio (uncompressed_size/compressed_size) of 1.6. That seems a bit too low to justify all the above efforts.
My software is free, but my time generally not.