Why is my SIMD implementation of octal string parsing unable to beat the compiler's?

7 hours ago 1
ARTICLE AD BOX

To learn more about SIMD and intrinsics, I figured I'd try to write an octal string parser. For this example, consider the strings to be:

ASCII.

Fixed Length of 12.

Not null-terminated.

Prepended with zeros.

Guaranteed to never be malformed.

As a start, I wrote an implementation that was not vectorized or using intrinsic functions:

uint64_t parseOctalToBase10(const char message[12]) { uint64_t size = 0; for (int i = 0; i < 12; i++) { uint64_t value = message[i]; value -= '0'; size |= value << (3 * ((12 - 1) - i)); } return size; }

Using Clang version 19.1.7 on Linux, I compiled like this:

clang++ -std=c++26 -O3 -mavx512vl -mavx512bw -mbmi2 -S example.cpp -o example.s

This gave the following assembly:

_Z18parseOctalToBase10PKc: # @_Z18parseOctalToBase10PKc .cfi_startproc # %bb.0: vpmovsxbq (%rdi), %zmm0 vpsllvq .LCPI0_0(%rip), %zmm0, %zmm0 vpaddq .LCPI0_1(%rip), %zmm0, %zmm0 movsbq 8(%rdi), %rax shlq $9, %rax addq $-24576, %rax # imm = 0xA000 movsbq 9(%rdi), %rcx shlq $6, %rcx addq $-3072, %rcx # imm = 0xF400 movsbq 10(%rdi), %rdx leaq -384(,%rdx,8), %rdx orq %rcx, %rdx orq %rax, %rdx movsbq 11(%rdi), %rcx addq $-48, %rcx orq %rdx, %rcx vextracti64x4 $1, %zmm0, %ymm1 vporq %zmm1, %zmm0, %zmm0 vextracti128 $1, %ymm0, %xmm1 vpor %xmm1, %xmm0, %xmm0 vpshufd $238, %xmm0, %xmm1 # xmm1 = xmm0[2,3,2,3] vpor %xmm1, %xmm0, %xmm0 vmovq %xmm0, %rax orq %rcx, %rax vzeroupper retq

Next, I moved on to create a SIMD version:

inline uint64_t parseOctalToBase10SIMD(const char message[12]) { const uint64_t EXTRACTION_MASK_LOW = 0x0707070707070707; const uint64_t EXTRACTION_MASK_HIGH = 0x0000000007070707; __mmask16 loadMask = 0b0000111111111111; __m128i shuffleMask = _mm_setr_epi8(11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12); __m128i subtrahend = _mm_set1_epi8('0'); __m128i vector = _mm_maskz_loadu_epi8(loadMask, message); vector = _mm_shuffle_epi8(vector, shuffleMask); vector = _mm_sub_epi8(vector, subtrahend); uint64_t low = _mm_cvtsi128_si64(vector); low = _pext_u64(low, EXTRACTION_MASK_LOW); vector = _mm_srli_si128(vector, sizeof(uint64_t)); uint64_t high = _mm_cvtsi128_si64(vector); high = _pext_u64(high, EXTRACTION_MASK_HIGH); return (high << 24) | low; }

And when compiled with the same command as before, gave this assembly:

_Z22parseOctalToBase10SIMDPKc: # @_Z22parseOctalToBase10SIMDPKc .cfi_startproc # %bb.0: movw $4095, %ax # imm = 0xFFF kmovd %eax, %k1 vmovdqu8 (%rdi), %xmm0 {%k1} {z} vpshufb .LCPI1_0(%rip), %xmm0, %xmm0 # xmm0 = xmm0[11,10,9,8,7,6,5,4,3,2,1,0,15,14,13,12] vpaddb .LCPI1_1(%rip), %xmm0, %xmm0 vmovq %xmm0, %rax movabsq $506381209866536711, %rcx # imm = 0x707070707070707 pextq %rcx, %rax, %rcx vpextrq $1, %xmm0, %rax movl $117901063, %edx # imm = 0x7070707 pextq %rdx, %rax, %rax shlq $24, %rax orq %rcx, %rax retq

Problem

I ran a benchmark to evaluate the performance of the two methods. I create 100,000,000 octal strings initialized with random values. I then have a loop go through them all with either one of the implementations I wrote.

Here are the performance metrics I collected from 10 iterations:

Implementation Average Completion Duration (μs) σ
Compiler Optimized 131836 ±2488
Personal 161733 ±15987

The compiler's optimizations have beaten me. Average completion time is significantly faster. Additionally, the deviation for completion time is much greater in my own implementation.

Discussion

I am not quite sure why the compiler wins. I looked through uops.info to see if any of my instructions came with higher Uops. I didn't see any standouts. Is it the throughput? The instruction count on my own is lower.

When designing the function, I struggled with keeping the data in vector form for long. Converting from ASCII to integers was about the extent of what I was able to do. Having it in vector form and being able to pull a high and low 64 bit integer was helpful when passing it on to pextq . I was aiming to find a way to stack all the bits into the low position of the vector, but hadn't thought of a better way to do it. I used _mm_maskz_loadu_epi8 because the message's alignment wasn't guaranteed and the data length isn't 16 bytes. I wonder if this could be a problem. Though, I don't wish to put alignment or size constraints on the messages for any revised implementations.

The high STD makes me think (with very limited knowledge) that it may also be a caching problem.

Any insight would be greatly appreciated.

Read Entire Article