My favorites | Sign in
Project Home Wiki Issues Source
READ-ONLY: This project has been archived. For more information see this post.
Search
for
PmovIntrinsicBug  
Issues/bugs when using PMOVSX/PMOVZX/INSERTPS
Updated Aug 13, 2012 by Jake.Stine

There is an inherent design flaw in Intel's intrinsic wrapper for these instructions, which at best forces us to use sub-optimal code generation and at worst causes applications to crash randomly depending on compiler, compiler options, and surrounding code. (ouch!)

The only C signature available for the pmovsx/zx and insertps instructions are in the form of:

__m128i _mm_cvtepi8_epi16( __m128i src );
__m128i _mm_cvtepi8_epi32( __m128i src );
__m128i _mm_insert_ps( __m128 dest, __m128 src, u8 imm );
// etc.

These forms assume a source simd128 register. However one of the most common uses of these instructions is to load packed 8/16 bit data from memory and unpack them into SIMD registers, and since these instructions support m32/m64 memory references there's no worry of unaligned sequential unpacking. Typically like so, as seen from an assembly POV:

    mov       rax,  &packed_array_of_byte_data
Loop:
    pmovsxbw  xmm1, [rax]
    // ... perform operations on xmm1 ...
    add       rax,  8
    dec       [counter]
    jnz       Loop

Translated to sse-intrinsic form:

u8*     packed_array_of_byte_data;
for( int i=0; i<count; ++i) {
    __m128  xmm1;
    i_pmovsxbw( xmm1, packed_array_of_byte_data );
    // ... perform operations on xmm1 ...
    packed_array_of_byte_data+=8;
}

So far so good. Looks painless enough. But when we translate to the intel-intrinsic form, we see a problem:

u8*     packed_array_of_byte_data;
for( int i=0; i<count; ++i) {
    __m128i xmm1 = _mm_cvtepi8_epi16( *(__m128i*)packed_array_of_byte_data );	// typecast implies 128-bit alignment
    // ... perform operations on xmm1 ...
    packed_array_of_byte_data+=8;
}

Notice the typecast needed because _mm_cvtepi8_epi16(__m128i val) cannot accept a pointer or u64 value. This typecast implies alignment on all current generation compilers, but the source data is not aligned. Typically in release/O2 builds this Just Works Anyway(tm) because the compiler will optimize the pointer into the rooted movsxbw instruction, which supports unaligned accesses. Problems typically only arrise when the compiler is generating non-optimized code, which typically loads m128 registers onto the stack (from which debug value inspection is more reliable). The load will use movdqa, generating an illegal instruction exception when the second read is executed. Therefore the safe implementation alternative is this:

u8*     packed_array_of_byte_data;
for( int i=0; i<count; ++i) {
    __m128i xmm1 = _mm_cvtepi8_epi16( _mm_loadu_epi128((__m128i*)packed_array_of_byte_data) );
    // ... perform operations on xmm1 ...
    packed_array_of_byte_data+=8;
}

... which adds an extra unaligned instruction to the algorithm. When only applied to debug builds then this isn't a performance issue. Problem solved! Right? Unfortunately, there are some rare circumstances where optimized compiler output will precache the input value to an xmm register, instead of supplying the pointer directly into the movsxbw instruction. This seems to happen about 30-40% of the time in Visual Studio, enough that our sse_intrinsics.h library must assume the use of the extra movdqu instruction in all Visual Studio build targets -- release or debug.

GCC: I haven't personally witnessed unaligned exceptions in optimized builds generated by GCC, and therefore sse_intrinsics.h does not enable the fix for GCC builds (yet).

Vectorized Loop Alternative

A final alternative, where vectorized loop unrolling is applicable, is to structure the loop as such:

u8*     packed_array_of_byte_data;
for( int i=0; i<count/2; ++i) {
    __m128  src128, xmm1, xmm2;
    i_movdqa( src128, packed_array_of_byte_data );
    i_pmovsxbw( xmm1, src128 );
    i_psllqw( xmm2, xmm2, 64);	// move upper 64 bits into lower 64 bits.
    i_pmovsxbw( xmm2, xmm2 );

    // ... perform operations on xmm1 and xmm2...
    packed_array_of_byte_data+=8;
}

Note that this alternative, -- while probably better than using movdqu on Core2/Phenom architectures -- is still not much of a consolation prize, since the efficient alternative (if it didn't cause bad compiler alignment assumptions) would be this:

u8*     packed_array_of_byte_data;
for( int i=0; i<count/2; ++i) {
    __m128  src128, xmm1, xmm2;
    i_pmovsxbw( xmm1, *(__m128*)(packed_array_of_byte_data+0) );
    i_pmovsxbw( xmm2, *(__m128*)(packed_array_of_byte_data+8) );

    // ... perform operations on xmm1 and xmm2...
    packed_array_of_byte_data+=8;
}

... all modern CPUs will optimize this into a single 128-bit fetch, such that the second 8-byte read executes with a latency of 0 or 1 cycles, making this the clear winner -- if only the Intel intrinsics had been designed in a way to handle it without compiler error.

INSERTPS Alternative

The best way to solve the INSERTPS problem for now is to use PINSRD instead. This may cause some performance loss on Nehalem architectures if the xmm is subsequently used as a float, but the perf loss will be less than that of other alternatives.

Powered by Google Project Hosting