c++ inline function wrapping single vmovups in GCC inline assembly -


i'm trying work around apparent bug in clang compiler using avx intrinsic _mm256_loadu_ps results in unnecessary instructions being output in assembly. in particular, first vmovups on first half of input vector xmm register, joins second half first using vinsertf128 instruction, slowing down program bit. instead expect single vmovups instruction compiler allocated ymm register.

i'm been comfortable sse/avx intrinsics, need drop down inline assembly i'm lost.

i'd inline function same following, vmovups should in inline assembly.

inline __mm256 v8floadu(const float* pf) {     return _mm256_loadu_ps(pf); } 

here's i've tried far, doesn't work (seems move *pf single float onto stack, loads space):

inline __mm256 v8floadu(const float* pf) {     __m256 m;     __asm__("vmovups %1, %0" : "=x" (m) : "xm" (pf));     return m; } 

thanks in advance.

by passing pointer input argument you're loading value of pointer rather points to. need pass value want load.

__m256 v8floadu(const float* pf) {     __m256 m;     __asm__("vmovups %1, %0" : "=x" (m) : "m" (*pf));     return m; } 

Comments

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -