c - Fastest way to work with unaligned data on a word-aligned proccessor? -

July 15, 2011

i'm doing project on arm cortex m0, not support unaligned(by 4bytes) access, , i'm trying optimize speed of operations on unaligned data.

i'm storing bluetooth low energy access addresses (48bit) 6-byte arrays in packed structs acting packet buffers. because of packing, ble addresses not starting @ word aligned address, , i'm running complications when optimizing access functions these addresses.

the first, , obvious approach loop operating on each byte in array individually. checking if 2 addresses same instance done this:

uint8_t ble_adv_addr_is_equal(uint8_t* addr1, uint8_t* addr2) {   (uint32_t = 0; < 6; ++i)   {     if (addr1[i] != addr2[i])       return 0;   }   return 1; }

i'm doing lot of comparisons in project, , wanted see if squeeze more speed out of function. realised aligned addresses, cast them uint64_t, , compare 48 bit masks applied, i.e.

((uint64_t)&addr1[0] & 0xffffffffffff) == ((uint64_t)&addr2[0] & 0xffffffffffff)

similar operations done writing, , works aligned versions. however, since addresses aren't word-aligned (or half-word), have tricks make work.

first off, came unoptimized nightmare of compiler macro:

#define addr_aligned(_addr) (uint64_t)(((*((uint64_t*)(((uint32_t)_addr) & ~0x03)) >> (8*(((uint32_t)_addr) & 0x03))) & 0x000000ffffffff)\                                     | (((*((uint64_t*)(((uint32_t)_addr+4) & ~0x03))) << (32-8*(((uint32_t)_addr) & 0x03)))) & 0x00ffff00000000)

it shifts entire address start @ previous word aligned memory position, regardless of offset. instance:

    0       1       2       3 |-------|-------|-------|-------| |.......|.......|.......|<addr0>| |<addr1>|<addr2>|<addr3>|<addr4>| |<addr5>|.......|.......|.......|

becomes

    0       1       2       3 |-------|-------|-------|-------| |<addr0>|<addr1>|<addr2>|<addr3>| |<addr4>|<addr5>|.......|.......| |.......|.......|.......|.......|

and can safely 64-bit comparison of 2 addresses, regardless of actual alignment:

addr_aligned(addr1) == addr_aligned(addr2)

neat! operation takes 71 lines of assembly when compiled arm-mdk, compared 53 when doing comparison in simple loop (i'm going disregard additional time spent in branch instructions here), , ~30 when unrolled. also, doesn't work writes, alignment happens in registers, not in memory. unaligning again require similar operation, , whole approach seems suck.

is unrolled for-loop working each byte individually fastest solution cases this? have experience similar scenarios, , feel sharing of wizardry here?

update

ok, because data has no alignment whatsover, need either read data in, byte byte, aligned buffers , fast 64-bit compares, or, if won't using data after compares, read in data bytes , 6 compares, in case calling memcmp() might better option.

for @ least 16-bit aligned:

  u16 *src1 = (u16 *)addr1;   u16 *src2 = (u16 *)addr2;   (int = 0; < 3; ++i)  {     if (src1[i] != src2[i])       return 0;  }   return 1;

will twice fast byte comparisons , might best can reasonably long data @ least 2-byte aligned. i'd expect compiler remove loop , use conditionally executed if statements instead.

trying 32-bit aligned reads not faster unless can guarentee source1 , 2 similiarly aligned (add1 & 0x03) == (addr2 & 0x03). if case, can read in 32-bit value , 16-bit (or visa-versa, depending on starting alignment) , remove 1 more compare.

as 16-bit shared base, can start there , compiler should generate nice ldrh type opcodes.

Search This Blog

UV code

c - Fastest way to work with unaligned data on a word-aligned proccessor? -

Comments

Post a Comment

Popular posts from this blog

shopping cart - Page redirect not working PHP -

php - How to modify a menu to show sub-menus -

python - Installing PyDev in eclipse is failed -