Why is my SIMD implementation of octal string parsing unable to beat the compiler's?

7 hours ago 1

ARTICLE AD BOX

To learn more about SIMD and intrinsics, I figured I'd try to write an octal string parser. For this example, consider the strings to be:

ASCII.

Fixed Length of 12.

Not null-terminated.

Prepended with zeros.

Guaranteed to never be malformed.

As a start, I wrote an implementation that was not vectorized or using intrinsic functions:

uint64_t parseOctalToBase10(const char message[12]) { uint64_t size = 0; for (int i = 0; i < 12; i++) { uint64_t value = message[i]; value -= '0'; size |= value << (3 * ((12 - 1) - i)); } return size; }

Using Clang version 19.1.7 on Linux, I compiled like this:

clang++ -std=c++26 -O3 -mavx512vl -mavx512bw -mbmi2 -S example.cpp -o example.s

This gave the following assembly:

_Z18parseOctalToBase10PKc: # @_Z18parseOctalToBase10PKc .cfi_startproc # %bb.0: vpmovsxbq (%rdi), %zmm0 vpsllvq .LCPI0_0(%rip), %zmm0, %zmm0 vpaddq .LCPI0_1(%rip), %zmm0, %zmm0 movsbq 8(%rdi), %rax shlq $9, %rax addq $-24576, %rax # imm = 0xA000 movsbq 9(%rdi), %rcx shlq $6, %rcx addq $-3072, %rcx # imm = 0xF400 movsbq 10(%rdi), %rdx leaq -384(,%rdx,8), %rdx orq %rcx, %rdx orq %rax, %rdx movsbq 11(%rdi), %rcx addq $-48, %rcx orq %rdx, %rcx vextracti64x4 $1, %zmm0, %ymm1 vporq %zmm1, %zmm0, %zmm0 vextracti128 $1, %ymm0, %xmm1 vpor %xmm1, %xmm0, %xmm0 vpshufd $238, %xmm0, %xmm1 # xmm1 = xmm0[2,3,2,3] vpor %xmm1, %xmm0, %xmm0 vmovq %xmm0, %rax orq %rcx, %rax vzeroupper retq

Next, I moved on to create a SIMD version:

inline uint64_t parseOctalToBase10SIMD(const char message[12]) { const uint64_t EXTRACTION_MASK_LOW = 0x0707070707070707; const uint64_t EXTRACTION_MASK_HIGH = 0x0000000007070707; __mmask16 loadMask = 0b0000111111111111; __m128i shuffleMask = _mm_setr_epi8(11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12); __m128i subtrahend = _mm_set1_epi8('0'); __m128i vector = _mm_maskz_loadu_epi8(loadMask, message); vector = _mm_shuffle_epi8(vector, shuffleMask); vector = _mm_sub_epi8(vector, subtrahend); uint64_t low = _mm_cvtsi128_si64(vector); low = _pext_u64(low, EXTRACTION_MASK_LOW); vector = _mm_srli_si128(vector, sizeof(uint64_t)); uint64_t high = _mm_cvtsi128_si64(vector); high = _pext_u64(high, EXTRACTION_MASK_HIGH); return (high << 24) | low; }

And when compiled with the same command as before, gave this assembly:

_Z22parseOctalToBase10SIMDPKc: # @_Z22parseOctalToBase10SIMDPKc .cfi_startproc # %bb.0: movw $4095, %ax # imm = 0xFFF kmovd %eax, %k1 vmovdqu8 (%rdi), %xmm0 {%k1} {z} vpshufb .LCPI1_0(%rip), %xmm0, %xmm0 # xmm0 = xmm0[11,10,9,8,7,6,5,4,3,2,1,0,15,14,13,12] vpaddb .LCPI1_1(%rip), %xmm0, %xmm0 vmovq %xmm0, %rax movabsq $506381209866536711, %rcx # imm = 0x707070707070707 pextq %rcx, %rax, %rcx vpextrq $1, %xmm0, %rax movl $117901063, %edx # imm = 0x7070707 pextq %rdx, %rax, %rax shlq $24, %rax orq %rcx, %rax retq

Problem

I ran a benchmark to evaluate the performance of the two methods. I create 100,000,000 octal strings initialized with random values. I then have a loop go through them all with either one of the implementations I wrote.

Here are the performance metrics I collected from 10 iterations:

Implementation Average Completion Duration (μs) σ

Compiler Optimized	131836	±2488
Personal	161733	±15987

The compiler's optimizations have beaten me. Average completion time is significantly faster. Additionally, the deviation for completion time is much greater in my own implementation.

Discussion

I am not quite sure why the compiler wins. I looked through uops.info to see if any of my instructions came with higher Uops. I didn't see any standouts. Is it the throughput? The instruction count on my own is lower.

When designing the function, I struggled with keeping the data in vector form for long. Converting from ASCII to integers was about the extent of what I was able to do. Having it in vector form and being able to pull a high and low 64 bit integer was helpful when passing it on to pextq . I was aiming to find a way to stack all the bits into the low position of the vector, but hadn't thought of a better way to do it. I used _mm_maskz_loadu_epi8 because the message's alignment wasn't guaranteed and the data length isn't 16 bytes. I wonder if this could be a problem. Though, I don't wish to put alignment or size constraints on the messages for any revised implementations.

The high STD makes me think (with very limited knowledge) that it may also be a caching problem.

Any insight would be greatly appreciated.

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Why is my SIMD implementation of octal string parsing unable to beat the compiler's?

ARTICLE AD BOX

Problem

Discussion

Related

Call a sub-function that deletes from a vector being iterated in a loop

Using std::meta::members_of() on a namespace

Where is a header for Microsoft::UI namespace to use in a C++/WinRT project?

LEFT SIDEBAR AD